Name: Tan Wen Tao Bryan
Admin No: P2214449
Class: DAAA/FT/2A/01
sklearn links:
pmdarima links:
scipy links:
statsmodel links:
Forecasting is a method of examining past and current statistics movements and patterns in order to gain some insight or hints about future trends and business movements. Forecasting is looking into the future to help beneficials prepare for it accordingly.
Energy consumption is the total amount of energy required for a given process. This includes the use of electricity, gas, water, oil and biomass.
#import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
#statsmodels libraries EDA
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.stattools import adfuller, kpss
from statsmodels.tsa.statespace.tools import diff
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.stattools import grangercausalitytests, coint
from statsmodels.tsa.vector_ar.vecm import coint_johansen
from statsmodels.stats.stattools import durbin_watson
from statsmodels.tsa.vector_ar.var_model import FEVD
#transform libraries (for EDA)
from sklearn.preprocessing import PowerTransformer
from scipy.signal import periodogram
#cartesion product library (use for joining dataframe columns)
from itertools import product
#time-series models
from pmdarima.arima import auto_arima, ARIMA
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.tsa.statespace.varmax import VARMAX
#cross validation
from sklearn.model_selection import TimeSeriesSplit
#evaluation metrics
from sklearn.metrics import mean_absolute_percentage_error, r2_score, mean_squared_error
#import energyConsumption_dataset csv file as a dataframe
energyConsumption_df = pd.read_csv('./ST1511_CA2_dataset/Energy Consumption Dataset.csv', sep=",")
display(energyConsumption_df)
| DATE | Gas Consumption (tons) | Electricity Consumption (MWh) | Water Consumption (tons) | |
|---|---|---|---|---|
| 0 | 1/1/1990 | 18.0 | 725.1 | 548.8 |
| 1 | 1/2/1990 | 15.8 | 706.7 | 640.7 |
| 2 | 1/3/1990 | 17.3 | 624.5 | 511.1 |
| 3 | 1/4/1990 | 18.9 | 574.7 | 515.3 |
| 4 | 1/5/1990 | 22.0 | 553.2 | 488.4 |
| ... | ... | ... | ... | ... |
| 392 | 1/9/2022 | 27.7 | 986.2 | 513.3 |
| 393 | 1/10/2022 | 31.8 | 936.1 | 373.1 |
| 394 | 1/11/2022 | 31.0 | 973.4 | 343.9 |
| 395 | 1/12/2022 | 32.4 | 1147.2 | 348.3 |
| 396 | 1/1/2023 | 31.3 | 1294.0 | 260.2 |
397 rows × 4 columns
Observations:
Nature of the dataset: Contains 397 rows and 4 columns
#Make a copy to prevent mutation
energyConsumption_ds = energyConsumption_df.copy()
#Shape of dataset
print(energyConsumption_ds.shape)
(397, 4)
print(energyConsumption_ds.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 397 entries, 0 to 396 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 DATE 397 non-null object 1 Gas Consumption (tons) 397 non-null float64 2 Electricity Consumption (MWh) 397 non-null float64 3 Water Consumption (tons) 397 non-null float64 dtypes: float64(3), object(1) memory usage: 12.5+ KB None
Observations:
#Convert date to date type by formatting in the appropriate Date format
energyConsumption_ds["DATE"] = pd.to_datetime(energyConsumption_ds["DATE"], format='%d/%m/%Y')
energyConsumption_ds.set_index('DATE', inplace=True)
display(energyConsumption_ds.head())
| Gas Consumption (tons) | Electricity Consumption (MWh) | Water Consumption (tons) | |
|---|---|---|---|
| DATE | |||
| 1990-01-01 | 18.0 | 725.1 | 548.8 |
| 1990-02-01 | 15.8 | 706.7 | 640.7 |
| 1990-03-01 | 17.3 | 624.5 | 511.1 |
| 1990-04-01 | 18.9 | 574.7 | 515.3 |
| 1990-05-01 | 22.0 | 553.2 | 488.4 |
#Descriptive Stats
energy_stats=energyConsumption_ds.describe(include="all").T
display(energy_stats)
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Gas Consumption (tons) | 397.0 | 23.785139 | 4.903452 | 11.6 | 20.2 | 23.5 | 27.9 | 46.0 |
| Electricity Consumption (MWh) | 397.0 | 888.472544 | 153.877594 | 553.2 | 771.1 | 897.8 | 1005.2 | 1294.0 |
| Water Consumption (tons) | 397.0 | 484.953652 | 133.908863 | 44.4 | 384.4 | 487.4 | 580.2 | 811.0 |
Observations:
#Unique years that the dataset consist of
np.unique(energyConsumption_ds.index.year)
array([1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000,
2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011,
2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022,
2023], dtype=int64)
#Visualise each column as a time series plot
fig = plt.figure(figsize=(18,13), tight_layout=True)
for i, column in enumerate(energyConsumption_ds.columns):
ax = fig.add_subplot(len(energyConsumption_ds.columns), 1, i+1)
energyConsumption_ds[column].plot(ax=ax)
ax.set_title(f"Time Series Visualisation of {column}", fontsize=16)
ax.grid(True)
plt.show()
Observations:
#Plot density based histograms to show distribution
fig, ax = plt.subplots(1, 3, figsize=(15,6))
sns.histplot(data=energyConsumption_ds, x='Gas Consumption (tons)', kde=True, stat="density", ax=ax[0])
ax[0].set_title("Distribution of Gas Consumption")
ax[0].set_yticks(np.arange(0,0.081, 0.01))
sns.histplot(data=energyConsumption_ds, x='Electricity Consumption (MWh)', kde=True, stat="density", ax=ax[1])
ax[1].set_title("Distribution of Electricity Consumption")
sns.histplot(data=energyConsumption_ds, x='Water Consumption (tons)', kde=True, stat="density", ax=ax[2])
ax[2].set_title("Distribution of Water Consumption")
plt.tight_layout()
plt.show()
#Boxplots to show median and distribution of data
fig, ax = plt.subplots(1, 3, figsize=(11,6))
sns.boxplot(data=energyConsumption_ds, y='Gas Consumption (tons)', ax=ax[0])
ax[0].set_title("Distribution of Gas Consumption")
sns.boxplot(data=energyConsumption_ds, y='Electricity Consumption (MWh)', ax=ax[1])
ax[1].set_title("Distribution of Electricity Consumption")
sns.boxplot(data=energyConsumption_ds, y='Water Consumption (tons)', ax=ax[2])
ax[2].set_title("Distribution of Water Consumption")
plt.tight_layout()
plt.show()
Observations:
# Split each column into two groups and compare whether the mean and variance of each group matches
for i in energyConsumption_ds.columns:
X1 = energyConsumption_ds[i].iloc[:len(energyConsumption_ds[i])//2]
X2 = energyConsumption_ds[i].iloc[len(energyConsumption_ds[i])//2:]
print(f'For {i}, \ngroup 1 mean = {X1.mean():.2f}, group 2 mean = {X2.mean():.2f}')
print(f'group 1 variance = {X1.var():.2f}, group 2 variance = {X2.var():.2f}')
print()
For Gas Consumption (tons), group 1 mean = 24.28, group 2 mean = 23.29 group 1 variance = 24.48, group 2 variance = 23.23 For Electricity Consumption (MWh), group 1 mean = 774.04, group 2 mean = 1002.33 group 1 variance = 12208.96, group 2 variance = 9083.62 For Water Consumption (tons), group 1 mean = 426.87, group 2 mean = 542.75 group 1 variance = 13118.77, group 2 variance = 16079.41
Time series is being represented by either a sum or a product of three components - (Trend, Seasonality & Random Component)
# Perform seasonal decomposition
for i in energyConsumption_ds.columns:
print(f'Column: {i}')
plt.rc("figure",figsize=(18,10))
decomposition = seasonal_decompose(energyConsumption_ds[i])
decomposition.plot()
plt.show()
Column: Gas Consumption (tons)
Column: Electricity Consumption (MWh)
Column: Water Consumption (tons)
Observations:
A stationary series is one where the properties of mean, variance and covariance do not vary with time, whereas a non-stationary series has properites which vary with time. ADF test determines how strongly a time series is defined by a trend. It can be determined if the series is stationary or not.
#ADF test
def adf_test(time_series, significant_value=0.05, name="", verbose=False):
unit_root=adfuller(time_series, autolag="BIC")
test_statistic = round(unit_root[0],4)
p_value = round(unit_root[1],4)
n_lags = round(unit_root[2], 4)
#Print summary
print(f'ADF Test on {name}')
print('-'*50)
print(f'Significance level = {significant_value}')
print(f'Test Statistic = {test_statistic}')
print(f'No of Lags = {n_lags}')
#Critical values for each percentage
for key, val in unit_root[4].items():
print(f"Critical Value ({key})= {round(val, 3)}")
#Find out if p-value is smaller than or larger than the significance value
if p_value<=significant_value:
print(f'P-Value = {p_value} Null hypothesis is rejected.')
print(f'The series is stationary.')
else:
print(f'P-Value = {p_value} Insufficient evidence to reject the null hypothesis.')
print(f'The series is non-stationary.')
#Initiating functions
for column in energyConsumption_ds.columns:
adf_test(energyConsumption_ds[column], name=column)
print()
ADF Test on Gas Consumption (tons) -------------------------------------------------- Significance level = 0.05 Test Statistic = -6.5666 No of Lags = 1 Critical Value (1%)= -3.447 Critical Value (5%)= -2.869 Critical Value (10%)= -2.571 P-Value = 0.0 Null hypothesis is rejected. The series is stationary. ADF Test on Electricity Consumption (MWh) -------------------------------------------------- Significance level = 0.05 Test Statistic = -2.0912 No of Lags = 12 Critical Value (1%)= -3.447 Critical Value (5%)= -2.869 Critical Value (10%)= -2.571 P-Value = 0.2481 Insufficient evidence to reject the null hypothesis. The series is non-stationary. ADF Test on Water Consumption (tons) -------------------------------------------------- Significance level = 0.05 Test Statistic = -7.0564 No of Lags = 1 Critical Value (1%)= -3.447 Critical Value (5%)= -2.869 Critical Value (10%)= -2.571 P-Value = 0.0 Null hypothesis is rejected. The series is stationary.
Observations:
KPSS test is a type of unit root test that tests for the stationarity of a given series around a deterministic trend.
#KPSS test
warnings.filterwarnings('ignore')
def kpss_test(time_series, significant_value=0.05, name="", verbose=False):
unit_root=kpss(time_series)
test_statistic = round(unit_root[0],4)
p_value = round(unit_root[1],4)
n_lags = round(unit_root[2], 4)
#Print summary
print(f'KPSS Test on {name}')
print('-'*50)
print(f'Significance level = {significant_value}')
print(f'Test Statistic = {test_statistic}')
print(f'No of Lags = {n_lags}')
#Critical values for each percentage
for key, val in unit_root[3].items():
print(f"Critical Value ({key})= {round(val, 3)}")
#Find out if p-value is smaller than or larger than the significance value
if p_value<=significant_value:
print(f'P-Value = {p_value} Null hypothesis is rejected.')
print(f'The series is non-stationary.')
else:
print(f'P-Value = {p_value} Insufficient evidence to reject the null hypothesis.')
print(f'The series is trend stationary.')
#Initiating functions
for column in energyConsumption_ds.columns:
kpss_test(energyConsumption_ds[column], name=column)
print()
KPSS Test on Gas Consumption (tons) -------------------------------------------------- Significance level = 0.05 Test Statistic = 0.3402 No of Lags = 10 Critical Value (10%)= 0.347 Critical Value (5%)= 0.463 Critical Value (2.5%)= 0.574 Critical Value (1%)= 0.739 P-Value = 0.1 Insufficient evidence to reject the null hypothesis. The series is trend stationary. KPSS Test on Electricity Consumption (MWh) -------------------------------------------------- Significance level = 0.05 Test Statistic = 3.5316 No of Lags = 10 Critical Value (10%)= 0.347 Critical Value (5%)= 0.463 Critical Value (2.5%)= 0.574 Critical Value (1%)= 0.739 P-Value = 0.01 Null hypothesis is rejected. The series is non-stationary. KPSS Test on Water Consumption (tons) -------------------------------------------------- Significance level = 0.05 Test Statistic = 0.8388 No of Lags = 10 Critical Value (10%)= 0.347 Critical Value (5%)= 0.463 Critical Value (2.5%)= 0.574 Critical Value (1%)= 0.739 P-Value = 0.01 Null hypothesis is rejected. The series is non-stationary.
Observations:
# def transform_box_cox(data):
# transformer=PowerTransformer(method='box-cox')
# transformed_data = transformer.fit_transform(data)
# return transformed_data
# columnsToBeTransformed = [["Gas Consumption (tons)", "Electricity Consumption (MWh)", "Water Consumption (tons)"]]
# for i in columnsToBeTransformed:
# energyConsumption_ds[i]=transform_box_cox(energyConsumption_ds[i])
# display(energyConsumption_ds.head())
#Perform differencing on Water Consumption (order=1)
water_diff=diff(energyConsumption_ds["Water Consumption (tons)"], k_diff=1)
adf_result1=adfuller(energyConsumption_ds["Water Consumption (tons)"], autolag="BIC")
adf_result2=adfuller(water_diff, autolag="BIC")
kpss_result1=kpss(energyConsumption_ds["Water Consumption (tons)"])
kpss_result2=kpss(water_diff)
print('ADF-test (p-value before differencing): %f' % adf_result1[1])
print('ADF-test (p-value after differencing): %f' % adf_result2[1])
print('KPSS-test (p-value before differencing): %f' % kpss_result1[1])
print('KPSS-test (p-value after differencing): %f' % kpss_result2[1])
fig, ax = plt.subplots(1, 2, figsize=(14, 4))
energyConsumption_ds["Water Consumption (tons)"].plot(ax=ax[0])
ax[0].grid(True)
water_diff.plot(ax=ax[1])
ax[1].grid(True)
ax[0].set_title('Time Series Before Differencing')
ax[1].set_title('Time Series After Differencing')
fig.suptitle("Water Consumption (tons)", fontsize=14, fontweight="bold")
plt.tight_layout()
plt.show()
ADF-test (p-value before differencing): 0.000000 ADF-test (p-value after differencing): 0.000000 KPSS-test (p-value before differencing): 0.010000 KPSS-test (p-value after differencing): 0.100000
Observations:
#Perform differencing on Electricity Consumption (order=1)
electricity_diff=diff(energyConsumption_ds["Electricity Consumption (MWh)"], k_diff=1)
adf_result1=adfuller(energyConsumption_ds["Electricity Consumption (MWh)"], autolag="BIC")
adf_result2=adfuller(electricity_diff, autolag="BIC")
kpss_result1=kpss(energyConsumption_ds["Electricity Consumption (MWh)"])
kpss_result2=kpss(electricity_diff)
print('ADF-test (p-value before differencing): %f' % adf_result1[1])
print('ADF-test (p-value after differencing): %f' % adf_result2[1])
print('KPSS-test (p-value before differencing): %f' % kpss_result1[1])
print('KPSS-test (p-value after differencing): %f' % kpss_result2[1])
fig, ax = plt.subplots(1, 2, figsize=(14, 4))
energyConsumption_ds["Electricity Consumption (MWh)"].plot(ax=ax[0])
ax[0].grid(True)
electricity_diff.plot(ax=ax[1])
ax[1].grid(True)
ax[0].set_title('Time Series Before Differencing')
ax[1].set_title('Time Series After Differencing')
fig.suptitle("Electricity Consumption (kWh)", fontsize=14, fontweight="bold")
plt.tight_layout()
plt.show()
ADF-test (p-value before differencing): 0.248052 ADF-test (p-value after differencing): 0.000000 KPSS-test (p-value before differencing): 0.010000 KPSS-test (p-value after differencing): 0.100000
Observations:
#Reassign value in the dataset to the differenced value and fill in missing value with 0
energyConsumption_ds["Water Consumption (tons)"]=water_diff
energyConsumption_ds["Electricity Consumption (MWh)"]=electricity_diff
energyConsumption_ds=energyConsumption_ds.fillna(0)
display(energyConsumption_ds.head())
| Gas Consumption (tons) | Electricity Consumption (MWh) | Water Consumption (tons) | |
|---|---|---|---|
| DATE | |||
| 1990-01-01 | 18.0 | 0.0 | 0.0 |
| 1990-02-01 | 15.8 | -18.4 | 91.9 |
| 1990-03-01 | 17.3 | -82.2 | -129.6 |
| 1990-04-01 | 18.9 | -49.8 | 4.2 |
| 1990-05-01 | 22.0 | -21.5 | -26.9 |
Autocorrelation is the correlation of a time series and its lagged version over time.
#Plot acf and pacf graphs
for i in energyConsumption_ds.columns:
fig,ax=plt.subplots(1,2, tight_layout=True)
plot_acf(energyConsumption_ds[i], ax=ax[0])
plot_pacf(energyConsumption_ds[i], ax=ax[1])
fig.suptitle(i, fontsize=14, fontweight="bold")
plt.show()
Observations:
#Perform differencing on Gas Consumption (order=1)
gas_diff=diff(energyConsumption_ds["Gas Consumption (tons)"], k_diff=1)
adf_result1=adfuller(energyConsumption_ds["Gas Consumption (tons)"], autolag="BIC")
adf_result2=adfuller(gas_diff, autolag="BIC")
kpss_result1=kpss(energyConsumption_ds["Gas Consumption (tons)"])
kpss_result2=kpss(gas_diff)
print('ADF-test (p-value before differencing): %f' % adf_result1[1])
print('ADF-test (p-value after differencing): %f' % adf_result2[1])
print('KPSS-test (p-value before differencing): %f' % kpss_result1[1])
print('KPSS-test (p-value after differencing): %f' % kpss_result2[1])
fig, ax = plt.subplots(1, 2, figsize=(14, 4))
energyConsumption_ds["Gas Consumption (tons)"].plot(ax=ax[0])
ax[0].grid(True)
gas_diff.plot(ax=ax[1])
ax[1].grid(True)
ax[0].set_title('Time Series Before Differencing')
ax[1].set_title('Time Series After Differencing')
fig.suptitle("Gas Consumption (tons)", fontsize=14, fontweight="bold")
plt.tight_layout()
plt.show()
ADF-test (p-value before differencing): 0.000000 ADF-test (p-value after differencing): 0.000000 KPSS-test (p-value before differencing): 0.100000 KPSS-test (p-value after differencing): 0.100000
#Reassign value in the dataset to the differenced value and fill in missing value with 0
energyConsumption_ds["Gas Consumption (tons)"]=gas_diff
energyConsumption_ds=energyConsumption_ds.fillna(0)
display(energyConsumption_ds.head())
| Gas Consumption (tons) | Electricity Consumption (MWh) | Water Consumption (tons) | |
|---|---|---|---|
| DATE | |||
| 1990-01-01 | 0.0 | 0.0 | 0.0 |
| 1990-02-01 | -2.2 | -18.4 | 91.9 |
| 1990-03-01 | 1.5 | -82.2 | -129.6 |
| 1990-04-01 | 1.6 | -49.8 | 4.2 |
| 1990-05-01 | 3.1 | -21.5 | -26.9 |
#Plot acf & pacf plot for Gas Consumption
fig,ax=plt.subplots(1,2, tight_layout=True)
plot_acf(energyConsumption_ds["Gas Consumption (tons)"], ax=ax[0])
plot_pacf(energyConsumption_ds["Gas Consumption (tons)"], ax=ax[1])
fig.suptitle("Gas Consumption (tons)", fontsize=14, fontweight="bold")
plt.show()
Observations:
#Re-assign the original values for Electricity Consumption
energyConsumption_ds["Electricity Consumption (MWh)"]=energyConsumption_df["Electricity Consumption (MWh)"].values
display(energyConsumption_ds.head())
| Gas Consumption (tons) | Electricity Consumption (MWh) | Water Consumption (tons) | |
|---|---|---|---|
| DATE | |||
| 1990-01-01 | 0.0 | 725.1 | 0.0 |
| 1990-02-01 | -2.2 | 706.7 | 91.9 |
| 1990-03-01 | 1.5 | 624.5 | -129.6 |
| 1990-04-01 | 1.6 | 574.7 | 4.2 |
| 1990-05-01 | 3.1 | 553.2 | -26.9 |
#Plot Periodogram
freq, power = periodogram(energyConsumption_df["Electricity Consumption (MWh)"])
plt.figure(figsize=(8, 6))
period = 1/freq
plt.plot(period, power, color='indianred')
plt.xlim(0, 15)
plt.xticks(np.arange(0, 15, 2))
plt.xlabel('Period')
plt.ylabel('Power Spectral Density')
plt.title('Periodogram of Electricity Consumption (Mwh)')
plt.show()
Observations:
#Re-assign the original values for Electricity Consumption
energyConsumption_ds["Electricity Consumption (MWh)"]=energyConsumption_df["Electricity Consumption (MWh)"].values
display(energyConsumption_ds.head())
| Gas Consumption (tons) | Electricity Consumption (MWh) | Water Consumption (tons) | |
|---|---|---|---|
| DATE | |||
| 1990-01-01 | 0.0 | 725.1 | 0.0 |
| 1990-02-01 | -2.2 | 706.7 | 91.9 |
| 1990-03-01 | 1.5 | 624.5 | -129.6 |
| 1990-04-01 | 1.6 | 574.7 | 4.2 |
| 1990-05-01 | 3.1 | 553.2 | -26.9 |
#Calculate the seasonal period with the lowest standard deviation
S = np.arange(1, 13)
season_std = []
for season in S:
data = energyConsumption_ds["Electricity Consumption (MWh)"]
seasonality = data.diff(periods=season).fillna(0)
std = seasonality.std()
season_std.append(std)
#Plot a graph (Seasonal Period Vs Standard Deviation)
plt.figure(figsize=(8,6))
ax=plt.axes()
ax=sns.lineplot(x=S, y=season_std, marker="o", color="#316FBE")
#Get the lowest standard deviation
lowest_std = min(season_std)
plt.axhline(y=lowest_std, color="#b58900", linestyle="--")
plt.yticks(np.arange(30, 151, 10))
plt.xlabel("Seasonal Period")
plt.ylabel("Standard Deviation")
plt.title("Seasonal Period Vs Standard Deviation", fontweight="bold", fontsize=14)
plt.show()
Observations:
#Perform differencing on Electricity Consumption (order=1)
electricity_diff=diff(energyConsumption_ds["Electricity Consumption (MWh)"],
k_diff=1, k_seasonal_diff=1, seasonal_periods=6)
adf_result1=adfuller(energyConsumption_ds["Electricity Consumption (MWh)"], autolag="BIC")
adf_result2=adfuller(electricity_diff, autolag="BIC")
kpss_result1=kpss(energyConsumption_ds["Electricity Consumption (MWh)"])
kpss_result2=kpss(electricity_diff)
print('ADF-test (p-value before differencing): %f' % adf_result1[1])
print('ADF-test (p-value after differencing): %f' % adf_result2[1])
print('KPSS-test (p-value before differencing): %f' % kpss_result1[1])
print('KPSS-test (p-value after differencing): %f' % kpss_result2[1])
fig, ax = plt.subplots(1, 2, figsize=(14, 4))
energyConsumption_ds["Electricity Consumption (MWh)"].plot(ax=ax[0])
ax[0].grid(True)
electricity_diff.plot(ax=ax[1])
ax[1].grid(True)
ax[0].set_title('Time Series Before Differencing')
ax[1].set_title('Time Series After Differencing')
fig.suptitle("Electricity Consumption (kWh)", fontsize=14, fontweight="bold")
plt.tight_layout()
plt.show()
ADF-test (p-value before differencing): 0.248052 ADF-test (p-value after differencing): 0.000000 KPSS-test (p-value before differencing): 0.010000 KPSS-test (p-value after differencing): 0.100000
#Reassign value in the dataset to the differenced value and fill in missing value with 0
energyConsumption_ds["Electricity Consumption (MWh)"]=electricity_diff
energyConsumption_ds=energyConsumption_ds.fillna(0)
display(energyConsumption_ds.head())
| Gas Consumption (tons) | Electricity Consumption (MWh) | Water Consumption (tons) | |
|---|---|---|---|
| DATE | |||
| 1990-01-01 | 0.0 | 0.0 | 0.0 |
| 1990-02-01 | -2.2 | 0.0 | 91.9 |
| 1990-03-01 | 1.5 | 0.0 | -129.6 |
| 1990-04-01 | 1.6 | 0.0 | 4.2 |
| 1990-05-01 | 3.1 | 0.0 | -26.9 |
#Plot acf & pacf for Electricity Consumption
fig,ax=plt.subplots(1,2, tight_layout=True)
plot_acf(energyConsumption_ds["Electricity Consumption (MWh)"], ax=ax[0])
plot_pacf(energyConsumption_ds["Electricity Consumption (MWh)"], ax=ax[1])
fig.suptitle("Electricity Consumption (MWh)", fontsize=14, fontweight="bold")
plt.show()
Observations:
#Re-assign the original values for Electricity Consumption
energyConsumption_ds["Electricity Consumption (MWh)"]=energyConsumption_df["Electricity Consumption (MWh)"].values
display(energyConsumption_ds.head())
| Gas Consumption (tons) | Electricity Consumption (MWh) | Water Consumption (tons) | |
|---|---|---|---|
| DATE | |||
| 1990-01-01 | 0.0 | 725.1 | 0.0 |
| 1990-02-01 | -2.2 | 706.7 | 91.9 |
| 1990-03-01 | 1.5 | 624.5 | -129.6 |
| 1990-04-01 | 1.6 | 574.7 | 4.2 |
| 1990-05-01 | 3.1 | 553.2 | -26.9 |
#Perform differencing on Electricity Consumption (order=1)
electricity_diff=diff(energyConsumption_ds["Electricity Consumption (MWh)"],
k_diff=1, k_seasonal_diff=1, seasonal_periods=12)
adf_result1=adfuller(energyConsumption_ds["Electricity Consumption (MWh)"], autolag="BIC")
adf_result2=adfuller(electricity_diff, autolag="BIC")
kpss_result1=kpss(energyConsumption_ds["Electricity Consumption (MWh)"])
kpss_result2=kpss(electricity_diff)
print('ADF-test (p-value before differencing): %f' % adf_result1[1])
print('ADF-test (p-value after differencing): %f' % adf_result2[1])
print('KPSS-test (p-value before differencing): %f' % kpss_result1[1])
print('KPSS-test (p-value after differencing): %f' % kpss_result2[1])
fig, ax = plt.subplots(1, 2, figsize=(14, 4))
energyConsumption_ds["Electricity Consumption (MWh)"].plot(ax=ax[0])
ax[0].grid(True)
electricity_diff.plot(ax=ax[1])
ax[1].grid(True)
ax[0].set_title('Time Series Before Differencing')
ax[1].set_title('Time Series After Differencing')
fig.suptitle("Electricity Consumption (kWh)", fontsize=14, fontweight="bold")
plt.tight_layout()
plt.show()
ADF-test (p-value before differencing): 0.248052 ADF-test (p-value after differencing): 0.000000 KPSS-test (p-value before differencing): 0.010000 KPSS-test (p-value after differencing): 0.100000
#Reassign value in the dataset to the differenced value and fill in missing value with 0
energyConsumption_ds["Electricity Consumption (MWh)"]=electricity_diff
energyConsumption_ds=energyConsumption_ds.fillna(0)
display(energyConsumption_ds.head())
| Gas Consumption (tons) | Electricity Consumption (MWh) | Water Consumption (tons) | |
|---|---|---|---|
| DATE | |||
| 1990-01-01 | 0.0 | 0.0 | 0.0 |
| 1990-02-01 | -2.2 | 0.0 | 91.9 |
| 1990-03-01 | 1.5 | 0.0 | -129.6 |
| 1990-04-01 | 1.6 | 0.0 | 4.2 |
| 1990-05-01 | 3.1 | 0.0 | -26.9 |
#Plot acf & pacf for Electricity Consumption
fig,ax=plt.subplots(1,2, tight_layout=True)
plot_acf(energyConsumption_ds["Electricity Consumption (MWh)"], ax=ax[0])
plot_pacf(energyConsumption_ds["Electricity Consumption (MWh)"], ax=ax[1])
fig.suptitle("Electricity Consumption (MWh)", fontsize=14, fontweight="bold")
plt.show()
Observations:
#Plot acf and pacf graphs
for i in energyConsumption_ds.columns:
fig,ax=plt.subplots(1,2, tight_layout=True)
plot_acf(energyConsumption_ds[i], ax=ax[0])
plot_pacf(energyConsumption_ds[i], ax=ax[1])
fig.suptitle(i, fontsize=14, fontweight="bold")
plt.show()
Observations:
Order of (p,d,q) and (P,D,Q,S)
Grangers Causality Test determines whether one time series is useful in forecasting another. Focuses on a rather short term causality and we can consider a Multivariate Model for time series
#Function for Grangers Causality Test
def grangers_causality_matrix(data, variables, maxlag=13, test = 'ssr_chi2test', verbose=False):
dataset = pd.DataFrame(np.zeros((len(variables), len(variables))), columns=variables, index=variables)
for c in dataset.columns:
for r in dataset.index:
test_result = grangercausalitytests(data[[r,c]], maxlag=maxlag, verbose=False)
p_values = [round(test_result[i+1][0][test][1],4) for i in range(maxlag)]
if verbose:
print(f'Y = {r}, X = {c}, P Values = {p_values}')
min_p_value = np.min(p_values)
dataset.loc[r,c] = min_p_value
dataset.columns = [var + '_x' for var in variables]
dataset.index = [var + '_y' for var in variables]
return dataset
granger_df = grangers_causality_matrix(energyConsumption_ds, variables = energyConsumption_ds.columns)
#Print out significant pairs that show causality
pairs = granger_df.unstack().sort_values(kind="quicksort")
mask = pairs < 0.05
print("Significant Pairs:")
print(pairs[mask])
plt.figure(figsize=(8,6))
sns.heatmap(granger_df, annot=True, cmap='seismic', vmin=0)
plt.title("Grangers Causality Test")
plt.show()
Significant Pairs: Gas Consumption (tons)_x Water Consumption (tons)_y 0.0052 dtype: float64
Observations:
Cointegration refers to long-term equilibrium relationship between non-stationary variables, means they move together in the long run despite having individual trends.
Limitation: Assumes that the cointegrating relationship is linear and there is only one linear relationship among the variables is being tested.
#Function for Engle-Granger two-step cointegration test
def EngleGranger_coint_test(dataframe, critical_level = 0.05):
n = dataframe.shape[1]
pvalue_matrix = np.ones((n, n))
keys = dataframe.columns
pairs = []
for i in range(n):
for j in range(i+1, n):
series1 = dataframe[keys[i]]
series2 = dataframe[keys[j]]
result = coint(series1, series2)
pvalue = result[1]
pvalue_matrix[i, j] = pvalue
if pvalue < critical_level:
pairs.append((keys[i], keys[j], pvalue))
return pvalue_matrix, pairs
#Create dataframe to store the matrix
pvalue_matrix, pairs = EngleGranger_coint_test(energyConsumption_ds)
coint_pvalue_matrix_df = pd.DataFrame(pvalue_matrix)
for pair in pairs:
print(pair)
plt.figure(figsize=(8,6))
sns.heatmap(coint_pvalue_matrix_df, xticklabels=energyConsumption_ds.columns,
yticklabels=energyConsumption_ds.columns, annot=True, cmap='seismic')
plt.title('Engle-Granger Cointegration Test')
plt.show()
('Gas Consumption (tons)', 'Electricity Consumption (MWh)', 7.310878998998921e-11)
('Gas Consumption (tons)', 'Water Consumption (tons)', 1.2408872926483798e-16)
('Electricity Consumption (MWh)', 'Water Consumption (tons)', 1.356676425276096e-11)
Observations:
An extension of Engle-Granger two-step cointegration test and can handle cases when there are more than two stationary variables involved.
coint_johansen = coint_johansen(energyConsumption_ds, 1, 1)
critical_vals = {"0.90": 0, "0.95": 1, "0.99": 2}
#Trace statistic
trace_stat = coint_johansen.lr1
cvt = coint_johansen.cvt[:, critical_vals[str(1 - 0.05)]]
# Summary
print("Column : Test Statistic > CI(95%) => Significant")
for col, trace, ci in zip(energyConsumption_ds.columns, trace_stat, cvt):
print(col, ":", round(trace, 2), ">", ci, " => ", trace > ci)
Column : Test Statistic > CI(95%) => Significant Gas Consumption (tons) : 813.96 > 35.0116 => True Electricity Consumption (MWh) : 510.52 > 18.3985 => True Water Consumption (tons) : 230.69 > 3.8415 => True
Observations:
#Convert date to date type by formatting in the appropriate Date format
energyConsumption_df["DATE"] = pd.to_datetime(energyConsumption_df["DATE"], format='%d/%m/%Y')
energyConsumption_df.set_index('DATE', inplace=True)
display(energyConsumption_df.head())
| Gas Consumption (tons) | Electricity Consumption (MWh) | Water Consumption (tons) | |
|---|---|---|---|
| DATE | |||
| 1990-01-01 | 18.0 | 725.1 | 548.8 |
| 1990-02-01 | 15.8 | 706.7 | 640.7 |
| 1990-03-01 | 17.3 | 624.5 | 511.1 |
| 1990-04-01 | 18.9 | 574.7 | 515.3 |
| 1990-05-01 | 22.0 | 553.2 | 488.4 |
#Reason why this date was chosen to split: 80% data before this date & 20% data during and after this date
split_date = '2016-06-01'
train_set = energyConsumption_df[energyConsumption_df.index < split_date]
test_set = energyConsumption_df[energyConsumption_df.index >= split_date]
print(f"train.shape = {train_set.shape}, test.shape = {test_set.shape}")
print(type(train_set.index))
train.shape = (317, 3), test.shape = (80, 3) <class 'pandas.core.indexes.datetimes.DatetimeIndex'>
#Visualise train and test time series in a time series plot
fig = plt.figure(figsize=(18,13), tight_layout=True)
for i, column in enumerate(energyConsumption_ds.columns):
ax = fig.add_subplot(len(energyConsumption_ds.columns), 1, i+1)
train_set[column].plot(ax=ax, color="dodgerblue")
test_set[column].plot(ax=ax, color="green")
ax.set_title(f"Time Series Visualisation of {column}", fontsize=16)
ax.grid(True)
plt.show()
The following Time-Series Models are used:
A few methods I used to evaluate how each model would handle each column:
Image Source: Miro MediumI will be using this metrics to see which is the best order of parameters. I will just check the summary to see how well the ARIMA model do with the order Auto ARIMA deem the best.
Entails a calculation of maximum log-likelihood and a higher penalty term for a higher number of parameters $$ BIC_i = -2logL_i+p_ilogn $$
$L$: Max likelihood - parameter with the highest probability of correctly representing the relationship between input & output
#Stepwise set to True to see the BIC for every parameter searched
arima_modelGas=auto_arima(train_set["Gas Consumption (tons)"], start_p=1, start_q=1,
test='adf', max_p=5, max_q=5, m=1, information_criterion='bic',
seasonal=False, start_P=0, D=None, trace=True,
error_action="ignore", suppress_warnings=True, stepwise=True)
print(arima_modelGas)
Performing stepwise search to minimize bic ARIMA(1,1,1)(0,0,0)[0] intercept : BIC=1726.744, Time=0.18 sec ARIMA(0,1,0)(0,0,0)[0] intercept : BIC=1794.151, Time=0.05 sec ARIMA(1,1,0)(0,0,0)[0] intercept : BIC=1763.273, Time=0.06 sec ARIMA(0,1,1)(0,0,0)[0] intercept : BIC=1745.416, Time=0.07 sec ARIMA(0,1,0)(0,0,0)[0] : BIC=1788.396, Time=0.03 sec ARIMA(2,1,1)(0,0,0)[0] intercept : BIC=1730.887, Time=0.43 sec ARIMA(1,1,2)(0,0,0)[0] intercept : BIC=1731.259, Time=0.21 sec ARIMA(0,1,2)(0,0,0)[0] intercept : BIC=1736.710, Time=0.13 sec ARIMA(2,1,0)(0,0,0)[0] intercept : BIC=1762.028, Time=0.10 sec ARIMA(2,1,2)(0,0,0)[0] intercept : BIC=1735.755, Time=0.24 sec ARIMA(1,1,1)(0,0,0)[0] : BIC=1720.989, Time=0.07 sec ARIMA(0,1,1)(0,0,0)[0] : BIC=1739.661, Time=0.04 sec ARIMA(1,1,0)(0,0,0)[0] : BIC=1757.519, Time=0.04 sec ARIMA(2,1,1)(0,0,0)[0] : BIC=1725.131, Time=0.22 sec ARIMA(1,1,2)(0,0,0)[0] : BIC=1725.504, Time=0.09 sec ARIMA(0,1,2)(0,0,0)[0] : BIC=1730.958, Time=0.05 sec ARIMA(2,1,0)(0,0,0)[0] : BIC=1756.274, Time=0.04 sec ARIMA(2,1,2)(0,0,0)[0] : BIC=1729.999, Time=0.12 sec Best model: ARIMA(1,1,1)(0,0,0)[0] Total fit time: 2.192 seconds ARIMA(1,1,1)(0,0,0)[0]
arima_modelGas.summary()
| Dep. Variable: | y | No. Observations: | 317 |
|---|---|---|---|
| Model: | SARIMAX(1, 1, 1) | Log Likelihood | -851.861 |
| Date: | Thu, 10 Aug 2023 | AIC | 1709.721 |
| Time: | 07:26:44 | BIC | 1720.989 |
| Sample: | 01-01-1990 | HQIC | 1714.223 |
| - 05-01-2016 | |||
| Covariance Type: | opg |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| ar.L1 | 0.4200 | 0.043 | 9.873 | 0.000 | 0.337 | 0.503 |
| ma.L1 | -0.9052 | 0.036 | -24.899 | 0.000 | -0.976 | -0.834 |
| sigma2 | 12.8153 | 0.503 | 25.479 | 0.000 | 11.829 | 13.801 |
| Ljung-Box (L1) (Q): | 0.27 | Jarque-Bera (JB): | 2007.30 |
|---|---|---|---|
| Prob(Q): | 0.60 | Prob(JB): | 0.00 |
| Heteroskedasticity (H): | 2.28 | Skew: | 0.68 |
| Prob(H) (two-sided): | 0.00 | Kurtosis: | 15.27 |
#Stepwise set to True to see the BIC for every parameter searched
#Does not account for seasonality
arima_modelElectricity=auto_arima(train_set["Electricity Consumption (MWh)"], start_p=1, start_q=1,
test='adf', max_p=5, max_q=5, information_criterion='bic',
seasonal=False, start_P=0, D=None, trace=True,
error_action="ignore", suppress_warnings=True, stepwise=True)
print(arima_modelElectricity)
Performing stepwise search to minimize bic ARIMA(1,0,1)(0,0,0)[0] : BIC=3533.438, Time=0.08 sec ARIMA(0,0,0)(0,0,0)[0] : BIC=5193.516, Time=0.03 sec ARIMA(1,0,0)(0,0,0)[0] : BIC=inf, Time=0.03 sec ARIMA(0,0,1)(0,0,0)[0] : BIC=inf, Time=0.08 sec ARIMA(2,0,1)(0,0,0)[0] : BIC=3538.824, Time=0.16 sec ARIMA(1,0,2)(0,0,0)[0] : BIC=3470.074, Time=0.16 sec ARIMA(0,0,2)(0,0,0)[0] : BIC=inf, Time=0.17 sec ARIMA(2,0,2)(0,0,0)[0] : BIC=3447.339, Time=0.29 sec ARIMA(3,0,2)(0,0,0)[0] : BIC=inf, Time=0.39 sec ARIMA(2,0,3)(0,0,0)[0] : BIC=3421.661, Time=0.29 sec ARIMA(1,0,3)(0,0,0)[0] : BIC=3417.635, Time=0.25 sec ARIMA(0,0,3)(0,0,0)[0] : BIC=inf, Time=0.38 sec ARIMA(1,0,4)(0,0,0)[0] : BIC=3404.362, Time=0.34 sec ARIMA(0,0,4)(0,0,0)[0] : BIC=inf, Time=0.50 sec ARIMA(2,0,4)(0,0,0)[0] : BIC=3382.828, Time=0.55 sec ARIMA(3,0,4)(0,0,0)[0] : BIC=3426.799, Time=0.67 sec ARIMA(2,0,5)(0,0,0)[0] : BIC=3342.279, Time=0.51 sec ARIMA(1,0,5)(0,0,0)[0] : BIC=3350.089, Time=0.46 sec ARIMA(3,0,5)(0,0,0)[0] : BIC=inf, Time=0.79 sec ARIMA(2,0,5)(0,0,0)[0] intercept : BIC=3456.562, Time=0.71 sec Best model: ARIMA(2,0,5)(0,0,0)[0] Total fit time: 6.840 seconds ARIMA(2,0,5)(0,0,0)[0]
#Stepwise set to True to see the BIC for every parameter searched
#Account for seasonality
arima_modelElectricity=auto_arima(train_set["Electricity Consumption (MWh)"], start_p=1, start_q=1,
test='adf', max_p=5, max_q=5, m=12, d=1, max_P=3,max_Q=3, information_criterion='bic',
seasonal=True, start_P=0, D=None, trace=True,
error_action="ignore", suppress_warnings=True, stepwise=True)
print(arima_modelElectricity)
Performing stepwise search to minimize bic ARIMA(1,1,1)(0,1,1)[12] : BIC=2772.885, Time=0.75 sec ARIMA(0,1,0)(0,1,0)[12] : BIC=2938.015, Time=0.03 sec ARIMA(1,1,0)(1,1,0)[12] : BIC=2883.919, Time=0.13 sec ARIMA(0,1,1)(0,1,1)[12] : BIC=2809.733, Time=0.31 sec ARIMA(1,1,1)(0,1,0)[12] : BIC=inf, Time=0.20 sec ARIMA(1,1,1)(1,1,1)[12] : BIC=2778.554, Time=0.96 sec ARIMA(1,1,1)(0,1,2)[12] : BIC=2778.528, Time=1.64 sec ARIMA(1,1,1)(1,1,0)[12] : BIC=2827.936, Time=0.46 sec ARIMA(1,1,1)(1,1,2)[12] : BIC=2782.610, Time=2.83 sec ARIMA(1,1,0)(0,1,1)[12] : BIC=2826.664, Time=0.28 sec ARIMA(2,1,1)(0,1,1)[12] : BIC=2776.543, Time=0.68 sec ARIMA(1,1,2)(0,1,1)[12] : BIC=2775.498, Time=0.75 sec ARIMA(0,1,0)(0,1,1)[12] : BIC=2831.958, Time=0.17 sec ARIMA(0,1,2)(0,1,1)[12] : BIC=2773.425, Time=0.50 sec ARIMA(2,1,0)(0,1,1)[12] : BIC=2805.775, Time=0.38 sec ARIMA(2,1,2)(0,1,1)[12] : BIC=2780.005, Time=1.05 sec ARIMA(1,1,1)(0,1,1)[12] intercept : BIC=inf, Time=0.68 sec Best model: ARIMA(1,1,1)(0,1,1)[12] Total fit time: 11.845 seconds ARIMA(1,1,1)(0,1,1)[12]
arima_modelElectricity.summary()
| Dep. Variable: | y | No. Observations: | 317 |
|---|---|---|---|
| Model: | SARIMAX(1, 1, 1)x(0, 1, 1, 12) | Log Likelihood | -1375.009 |
| Date: | Thu, 10 Aug 2023 | AIC | 2758.017 |
| Time: | 07:27:03 | BIC | 2772.885 |
| Sample: | 01-01-1990 | HQIC | 2763.965 |
| - 05-01-2016 | |||
| Covariance Type: | opg |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| ar.L1 | 0.5059 | 0.058 | 8.718 | 0.000 | 0.392 | 0.620 |
| ma.L1 | -0.9540 | 0.020 | -47.423 | 0.000 | -0.993 | -0.915 |
| ma.S.L12 | -0.7054 | 0.046 | -15.283 | 0.000 | -0.796 | -0.615 |
| sigma2 | 479.7243 | 30.285 | 15.840 | 0.000 | 420.366 | 539.083 |
| Ljung-Box (L1) (Q): | 0.33 | Jarque-Bera (JB): | 36.75 |
|---|---|---|---|
| Prob(Q): | 0.57 | Prob(JB): | 0.00 |
| Heteroskedasticity (H): | 2.55 | Skew: | -0.24 |
| Prob(H) (two-sided): | 0.00 | Kurtosis: | 4.64 |
#Stepwise set to True to see the BIC for every parameter searched
arima_modelWater=auto_arima(train_set["Water Consumption (tons)"], start_p=1, start_q=1,
test='adf', max_p=5, max_q=5, m=1, d=1, information_criterion='bic',
seasonal=False, start_P=0, D=None, trace=True,
error_action="ignore", suppress_warnings=True, stepwise=True)
print(arima_modelWater)
Performing stepwise search to minimize bic ARIMA(1,1,1)(0,0,0)[0] intercept : BIC=3854.921, Time=0.17 sec ARIMA(0,1,0)(0,0,0)[0] intercept : BIC=3930.713, Time=0.03 sec ARIMA(1,1,0)(0,0,0)[0] intercept : BIC=3899.250, Time=0.06 sec ARIMA(0,1,1)(0,0,0)[0] intercept : BIC=3869.609, Time=0.11 sec ARIMA(0,1,0)(0,0,0)[0] : BIC=3924.959, Time=0.03 sec ARIMA(2,1,1)(0,0,0)[0] intercept : BIC=3859.638, Time=0.22 sec ARIMA(1,1,2)(0,0,0)[0] intercept : BIC=3859.513, Time=0.21 sec ARIMA(0,1,2)(0,0,0)[0] intercept : BIC=3861.051, Time=0.16 sec ARIMA(2,1,0)(0,0,0)[0] intercept : BIC=3890.107, Time=0.06 sec ARIMA(2,1,2)(0,0,0)[0] intercept : BIC=3865.002, Time=0.39 sec ARIMA(1,1,1)(0,0,0)[0] : BIC=3849.228, Time=0.08 sec ARIMA(0,1,1)(0,0,0)[0] : BIC=3863.853, Time=0.05 sec ARIMA(1,1,0)(0,0,0)[0] : BIC=3893.496, Time=0.05 sec ARIMA(2,1,1)(0,0,0)[0] : BIC=3853.953, Time=0.12 sec ARIMA(1,1,2)(0,0,0)[0] : BIC=3853.831, Time=0.13 sec ARIMA(0,1,2)(0,0,0)[0] : BIC=3855.314, Time=0.07 sec ARIMA(2,1,0)(0,0,0)[0] : BIC=3884.352, Time=0.04 sec ARIMA(2,1,2)(0,0,0)[0] : BIC=3859.325, Time=0.22 sec Best model: ARIMA(1,1,1)(0,0,0)[0] Total fit time: 2.191 seconds ARIMA(1,1,1)(0,0,0)[0]
arima_modelWater.summary()
| Dep. Variable: | y | No. Observations: | 317 |
|---|---|---|---|
| Model: | SARIMAX(1, 1, 1) | Log Likelihood | -1915.981 |
| Date: | Thu, 10 Aug 2023 | AIC | 3837.961 |
| Time: | 07:27:05 | BIC | 3849.228 |
| Sample: | 01-01-1990 | HQIC | 3842.462 |
| - 05-01-2016 | |||
| Covariance Type: | opg |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| ar.L1 | 0.3952 | 0.050 | 7.891 | 0.000 | 0.297 | 0.493 |
| ma.L1 | -0.9167 | 0.026 | -34.879 | 0.000 | -0.968 | -0.865 |
| sigma2 | 1.078e+04 | 590.046 | 18.265 | 0.000 | 9620.656 | 1.19e+04 |
| Ljung-Box (L1) (Q): | 0.16 | Jarque-Bera (JB): | 108.38 |
|---|---|---|---|
| Prob(Q): | 0.69 | Prob(JB): | 0.00 |
| Heteroskedasticity (H): | 1.33 | Skew: | -0.34 |
| Prob(H) (two-sided): | 0.14 | Kurtosis: | 5.79 |
Observations:
def walk_forward_validate(column, train_set, test_set, model, original_data):
#store predictions
predictions=[]
data=train_set[column].values
model_selected=model.fit(data)
#making in-sample predictions of training data
pred_in_sample = model_selected.predict_in_sample()
trainPredicted = pd.Series(pred_in_sample, index=train_set[column].index)
for t in test_set[column].values:
#predicting the next time step
y=model_selected.predict(start=len(data), end=len(data))
#appends one step ahead prediction
predictions.append(y[0])
#adds current value from test set to data array which include actual value from the history
#to predict the next iteration prediction
data = np.append(data, t)
#update ARIMA model with new observation, ensure model uses updated history for next prediction
model_selected.update(t)
model_selected=model.fit(data)
#calculate the aic & bic scores based on the order
aic=model_selected.aic()
bic=model_selected.bic()
testPredicted = pd.Series(predictions, index=test_set[column].index)
pred_full = np.concatenate((pred_in_sample, predictions))
pred_full_Energy = pd.Series(pred_full, index=original_data[column].index)
#calculate the MAPE & RMSE scores for train set and validaion set based on the order
train_MAPE = round(mean_absolute_percentage_error(train_set[column], trainPredicted)*100,3)
train_RMSE = round(mean_squared_error(train_set[column], trainPredicted, squared=False),3)
validate_MAPE = round(mean_absolute_percentage_error(test_set[column], testPredicted)*100, 3)
validate_RMSE = round(mean_squared_error(test_set[column], testPredicted, squared=False),3)
return(
trainPredicted, testPredicted, pred_full_Energy, model_selected, aic, bic,
train_MAPE, train_RMSE, validate_MAPE, validate_RMSE)
#Gas Consumption Predictions for train and test data
model = ARIMA(order=(0,1,1))
(gas_trainPredicted, gas, pred_full_Gas, arima_modelGas, gas_aic, gas_bic,
gas_train_MAPE, gas_train_RMSE, gas_validate_MAPE, gas_validate_RMSE) = walk_forward_validate(
"Gas Consumption (tons)", train_set, test_set, model, energyConsumption_df
)
display(pred_full_Gas.head())
DATE 1990-01-01 0.001821 1990-02-01 18.001685 1990-03-01 16.727637 1990-04-01 17.009442 1990-05-01 17.889356 dtype: float64
#Visualise train and test time series in a time series plot
fig, ax = plt.subplots(2,1,figsize=(13,6))
#Plot training, testing & predicted values on the graph
train_set["Gas Consumption (tons)"].plot(ax=ax[0], label="training", color='dodgerblue')
test_set["Gas Consumption (tons)"].plot(ax=ax[0], label="testing", color='green')
pred_full_Gas.plot(ax=ax[0], label='prediction', color="orange")
ax[0].legend()
ax[0].set_title(f"ARIMA for Gas Consumption (tons)", fontsize=16)
ax[0].set_yticks(np.arange(0,51,10))
ax[0].grid(True)
#Focus on the predicted values and the test values
test_set["Gas Consumption (tons)"].plot(ax=ax[1], label="testing", color='green')
gas.plot(ax=ax[1], label='prediction', color="orange")
ax[1].legend()
ax[1].set_title(f"Closer look into Test and Predicted Values", fontsize=16)
ax[1].set_yticks(np.arange(0,51,10))
ax[1].grid(True)
plt.tight_layout()
plt.show()
Observations:
#print summary for Gas Consumption
arima_modelGas.summary()
| Dep. Variable: | y | No. Observations: | 397 |
|---|---|---|---|
| Model: | SARIMAX(0, 1, 1) | Log Likelihood | -1068.653 |
| Date: | Thu, 10 Aug 2023 | AIC | 2143.306 |
| Time: | 07:27:13 | BIC | 2155.250 |
| Sample: | 0 | HQIC | 2148.038 |
| - 397 | |||
| Covariance Type: | opg |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| intercept | 0.0341 | 0.087 | 0.394 | 0.694 | -0.136 | 0.204 |
| ma.L1 | -0.5347 | 0.029 | -18.449 | 0.000 | -0.592 | -0.478 |
| sigma2 | 12.9165 | 0.379 | 34.070 | 0.000 | 12.173 | 13.660 |
| Ljung-Box (L1) (Q): | 4.06 | Jarque-Bera (JB): | 2599.64 |
|---|---|---|---|
| Prob(Q): | 0.04 | Prob(JB): | 0.00 |
| Heteroskedasticity (H): | 2.24 | Skew: | 0.82 |
| Prob(H) (two-sided): | 0.00 | Kurtosis: | 15.44 |
Observations:
#Plot diagnostic plot
arima_modelGas.plot_diagnostics(figsize=(14,10))
plt.show()
Observations:
#Electricity Consumption Predictions for train and test data
model = ARIMA(order=(2,0,5))
(electricity_trainPredicted, electricity, pred_full_Elec, arima_modelElec, elec_aic, elec_bic,
elec_train_MAPE, elec_train_RMSE, elec_validate_MAPE, elec_validate_RMSE) = walk_forward_validate(
"Electricity Consumption (MWh)", train_set, test_set, model, energyConsumption_df
)
display(pred_full_Elec.head())
DATE 1990-01-01 846.326878 1990-02-01 738.124861 1990-03-01 720.508326 1990-04-01 617.486736 1990-05-01 607.414031 dtype: float64
#Visualise train and test time series in a time series plot
fig, ax = plt.subplots(2,1,figsize=(13,6))
#Plot training, testing & predicted values on the graph
train_set["Electricity Consumption (MWh)"].plot(ax=ax[0], label="training", color='dodgerblue')
test_set["Electricity Consumption (MWh)"].plot(ax=ax[0], label="testing", color='green')
pred_full_Elec.plot(ax=ax[0], label='prediction', color="orange")
ax[0].legend()
ax[0].set_title(f"ARIMA for Electricity Consumption (MWh)", fontsize=16)
ax[0].grid(True)
test_set["Electricity Consumption (MWh)"].plot(ax=ax[1], label="testing", color='green')
electricity.plot(ax=ax[1], label='prediction', color="orange")
ax[1].legend()
ax[1].set_title(f"Closer look into Test and Predicted Values", fontsize=16)
ax[1].grid(True)
plt.tight_layout()
plt.show()
Observations:
#print summary for Electricity Consumption
arima_modelElec.summary()
| Dep. Variable: | y | No. Observations: | 397 |
|---|---|---|---|
| Model: | SARIMAX(2, 0, 5) | Log Likelihood | -2160.346 |
| Date: | Thu, 10 Aug 2023 | AIC | 4338.692 |
| Time: | 07:28:12 | BIC | 4374.547 |
| Sample: | 0 | HQIC | 4352.895 |
| - 397 | |||
| Covariance Type: | opg |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| intercept | 168.4601 | 35.767 | 4.710 | 0.000 | 98.358 | 238.562 |
| ar.L1 | 1.2754 | 0.113 | 11.303 | 0.000 | 1.054 | 1.497 |
| ar.L2 | -0.4643 | 0.105 | -4.412 | 0.000 | -0.671 | -0.258 |
| ma.L1 | -0.0179 | 0.103 | -0.173 | 0.863 | -0.221 | 0.185 |
| ma.L2 | -0.2590 | 0.051 | -5.045 | 0.000 | -0.360 | -0.158 |
| ma.L3 | -0.0672 | 0.058 | -1.152 | 0.249 | -0.182 | 0.047 |
| ma.L4 | 0.2756 | 0.053 | 5.195 | 0.000 | 0.172 | 0.380 |
| ma.L5 | 0.4211 | 0.063 | 6.641 | 0.000 | 0.297 | 0.545 |
| sigma2 | 3095.7020 | 232.999 | 13.286 | 0.000 | 2639.033 | 3552.371 |
| Ljung-Box (L1) (Q): | 4.42 | Jarque-Bera (JB): | 0.54 |
|---|---|---|---|
| Prob(Q): | 0.04 | Prob(JB): | 0.76 |
| Heteroskedasticity (H): | 2.06 | Skew: | 0.06 |
| Prob(H) (two-sided): | 0.00 | Kurtosis: | 3.14 |
Observations:
#Plot diagnostic plot
arima_modelElec.plot_diagnostics(figsize=(14,10))
plt.show()
Observations:
#Water Consumption Predictions for train and test data
model = ARIMA(order=(0,1,1))
(water_trainPredicted, water, pred_full_Water, arima_modelWater, water_aic, water_bic,
water_train_MAPE, water_train_RMSE, water_validate_MAPE, water_validate_RMSE) = walk_forward_validate(
"Water Consumption (tons)", train_set, test_set, model, energyConsumption_df
)
display(pred_full_Water.head())
DATE 1990-01-01 0.002373 1990-02-01 544.919239 1990-03-01 597.541027 1990-04-01 559.785577 1990-05-01 542.052360 dtype: float64
#Visualise train and test time series in a time series plot
fig, ax = plt.subplots(2,1,figsize=(13,6))
#Plot training, testing & predicted values on the graph
train_set["Water Consumption (tons)"].plot(ax=ax[0], label="training", color='dodgerblue')
test_set["Water Consumption (tons)"].plot(ax=ax[0], label="testing", color='green')
pred_full_Water.plot(ax=ax[0], label='prediction', color="orange")
ax[0].legend()
ax[0].set_title(f"ARIMA for Water Consumption (tons)", fontsize=16)
ax[0].grid(True)
test_set["Water Consumption (tons)"].plot(ax=ax[1], label="testing", color='green')
water.plot(ax=ax[1], label='prediction', color="orange")
ax[1].legend()
ax[1].set_title(f"Closer look into Test and Predicted Values", fontsize=16)
ax[1].grid(True)
plt.tight_layout()
plt.show()
Observations:
#print summary for Water Consumption
arima_modelWater.summary()
| Dep. Variable: | y | No. Observations: | 397 |
|---|---|---|---|
| Model: | SARIMAX(0, 1, 1) | Log Likelihood | -2404.050 |
| Date: | Thu, 10 Aug 2023 | AIC | 4814.100 |
| Time: | 07:28:25 | BIC | 4826.044 |
| Sample: | 0 | HQIC | 4818.832 |
| - 397 | |||
| Covariance Type: | opg |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| intercept | -0.5743 | 2.335 | -0.246 | 0.806 | -5.151 | 4.002 |
| ma.L1 | -0.5644 | 0.035 | -16.018 | 0.000 | -0.633 | -0.495 |
| sigma2 | 1.097e+04 | 552.167 | 19.864 | 0.000 | 9886.232 | 1.21e+04 |
| Ljung-Box (L1) (Q): | 3.43 | Jarque-Bera (JB): | 82.06 |
|---|---|---|---|
| Prob(Q): | 0.06 | Prob(JB): | 0.00 |
| Heteroskedasticity (H): | 1.60 | Skew: | -0.02 |
| Prob(H) (two-sided): | 0.01 | Kurtosis: | 5.23 |
Observations:
#Plot diagnostic plot
arima_modelWater.plot_diagnostics(figsize=(14,10))
plt.show()
Observations:
print(f'Gas Consumption:')
print(f'Train MAPE: {gas_train_MAPE}%')
print(f'Train RMSE: {gas_train_RMSE}')
print(f'Validation MAPE: {gas_validate_MAPE}%')
print(f'Validation RMSE: {gas_validate_RMSE}')
print()
print(f'Electricity Consumption:')
print(f'Train MAPE: {elec_train_MAPE}%')
print(f'Train RMSE: {elec_train_RMSE}')
print(f'Validation MAPE: {elec_validate_MAPE}%')
print(f'Validation RMSE: {elec_validate_RMSE}')
print()
print(f'Water Consumption:')
print(f'Train MAPE: {water_train_MAPE}%')
print(f'Train RMSE: {water_train_RMSE}')
print(f'Validation MAPE: {water_validate_MAPE}%')
print(f'Validation RMSE: {water_validate_RMSE}')
Gas Consumption: Train MAPE: 11.742% Train RMSE: 3.854 Validation MAPE: 9.417% Validation RMSE: 3.044 Electricity Consumption: Train MAPE: 4.84% Train RMSE: 52.197 Validation MAPE: 5.567% Validation RMSE: 71.416 Water Consumption: Train MAPE: 22.588% Train RMSE: 111.539 Validation MAPE: 17.005% Validation RMSE: 94.14
Observations:
#Electricity Consumption Predictions for train and test data
model = ARIMA(order=(2,1,3), seasonal_order=(0,1,1,12))
(electricity_trainPredicted, electricity, pred_full_Elec, arima_modelElec, elec_aic, elec_bic,
elec_train_MAPE, elec_train_RMSE, elec_validate_MAPE, elec_validate_RMSE) = walk_forward_validate(
"Electricity Consumption (MWh)", train_set, test_set, model, energyConsumption_df
)
display(pred_full_Elec.head())
DATE 1990-01-01 -0.052931 1990-02-01 724.996031 1990-03-01 706.578398 1990-04-01 624.434188 1990-05-01 574.661576 dtype: float64
#Visualise train and test time series in a time series plot
fig, ax = plt.subplots(2,1,figsize=(13,6))
#Plot training, testing & predicted values on the graph
train_set["Electricity Consumption (MWh)"].plot(ax=ax[0], label="training", color='dodgerblue')
test_set["Electricity Consumption (MWh)"].plot(ax=ax[0], label="testing", color='green')
pred_full_Elec.plot(ax=ax[0], label='prediction', color="orange")
ax[0].legend()
ax[0].set_title(f"ARIMA with Seasonality for Electricity Consumption (MWh)", fontsize=16)
ax[0].grid(True)
test_set["Electricity Consumption (MWh)"].plot(ax=ax[1], label="testing", color='green')
electricity.plot(ax=ax[1], label='prediction', color="orange")
ax[1].legend()
ax[1].set_title(f"Closer look into Test and Predicted Values", fontsize=16)
ax[1].grid(True)
plt.tight_layout()
plt.show()
Observations:
#print summary for Electricity Consumption
arima_modelElec.summary()
| Dep. Variable: | y | No. Observations: | 397 |
|---|---|---|---|
| Model: | SARIMAX(2, 1, 3)x(0, 1, [1], 12) | Log Likelihood | -1769.128 |
| Date: | Thu, 10 Aug 2023 | AIC | 3554.256 |
| Time: | 07:32:39 | BIC | 3585.861 |
| Sample: | 0 | HQIC | 3566.792 |
| - 397 | |||
| Covariance Type: | opg |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| intercept | -0.0732 | 0.013 | -5.458 | 0.000 | -0.099 | -0.047 |
| ar.L1 | -0.5510 | 0.089 | -6.190 | 0.000 | -0.726 | -0.377 |
| ar.L2 | 0.4136 | 0.088 | 4.679 | 0.000 | 0.240 | 0.587 |
| ma.L1 | 0.1486 | 0.137 | 1.081 | 0.279 | -0.121 | 0.418 |
| ma.L2 | -0.9651 | 0.124 | -7.769 | 0.000 | -1.209 | -0.722 |
| ma.L3 | -0.1822 | 0.090 | -2.034 | 0.042 | -0.358 | -0.007 |
| ma.S.L12 | -0.7701 | 0.040 | -19.492 | 0.000 | -0.848 | -0.693 |
| sigma2 | 561.0392 | 64.649 | 8.678 | 0.000 | 434.330 | 687.749 |
| Ljung-Box (L1) (Q): | 0.06 | Jarque-Bera (JB): | 21.06 |
|---|---|---|---|
| Prob(Q): | 0.81 | Prob(JB): | 0.00 |
| Heteroskedasticity (H): | 2.92 | Skew: | -0.09 |
| Prob(H) (two-sided): | 0.00 | Kurtosis: | 4.13 |
Observations:
#Plot diagnostic plot
arima_modelElec.plot_diagnostics(figsize=(14,10))
plt.show()
Observations:
print(f'Electricity Consumption:')
print(f'Train MAPE: {elec_train_MAPE}%')
print(f'Train RMSE: {elec_train_RMSE}')
print(f'Validation MAPE: {elec_validate_MAPE}%')
print(f'Validation RMSE: {elec_validate_RMSE}')
print()
Electricity Consumption: Train MAPE: 2.499% Train RMSE: 50.18 Validation MAPE: 2.419% Validation RMSE: 31.64
Observations:
Now I will try to run the same model but from a different library just to see whether a different result was obtained.
#New walk forward cross validation function since a different library was used which means it has different methods
def walk_forward_validate_sarimax(column, train_set, test_set, model, order, seasonal_order):
#store predictions
trainPredictions=[]
testPredictions=[]
fullPredictions=[]
data=train_set[column].values
model_fit = model
trainPredictions = model_fit.get_prediction(start='1990-01-01', end='2016-05-01').predicted_mean
for t in test_set[column].values:
model = SARIMAX(data, order=order, seasonal_order=seasonal_order)
model_fit = model.fit()
#predicting the next time step
y_test = model_fit.forecast(steps=1)
#appends one step ahead prediction
testPredictions.append(y_test[0])
#adds current value from test set to data array which include actual value from the history
#to predict the next iteration prediction
data = np.append(data, t)
#Concatenate full predictions
fullPredictions=np.concatenate((trainPredictions, testPredictions))
#calculate the aic & bic scores based on the order
aic=model_fit.aic
bic=model_fit.bic
#convert everything to dataframe
trainPredicted = pd.Series(data=trainPredictions, index=train_set.index)
testPredicted = pd.Series(data=testPredictions, index=test_set.index)
pred_full_energy = pd.Series(data=fullPredictions, index=energyConsumption_df[column].index)
#calculate the MAPE & RMSE scores for train set and validaion set based on the order
train_MAPE = round(mean_absolute_percentage_error(train_set[column], trainPredicted)*100,3)
train_RMSE = round(mean_squared_error(train_set[column], trainPredicted, squared=False),3)
validate_MAPE = round(mean_absolute_percentage_error(test_set[column], testPredicted)*100, 3)
validate_RMSE = round(mean_squared_error(test_set[column], testPredicted, squared=False),3)
return(
trainPredicted, testPredicted, pred_full_energy, aic, bic,
train_MAPE, train_RMSE, validate_MAPE, validate_RMSE)
#Electricity Consumption Predictions for train and test data
sarimax_model_Elec = SARIMAX(train_set["Electricity Consumption (MWh)"], order=(2,1,3), seasonal_order=(0,1,1,12)).fit()
(electricity_trainPredicted, electricity, pred_full_Elec, elec_aic, elec_bic,
elec_train_MAPE, elec_train_RMSE, elec_validate_MAPE, elec_validate_RMSE) = walk_forward_validate_sarimax(
"Electricity Consumption (MWh)", train_set, test_set, sarimax_model_Elec, (2,1,3), (0,1,1,12)
)
display(pred_full_Elec.head())
DATE 1990-01-01 0.000000 1990-02-01 725.048405 1990-03-01 706.617533 1990-04-01 624.492987 1990-05-01 574.722391 dtype: float64
#Visualise train and test time series in a time series plot
fig, ax = plt.subplots(2,1,figsize=(13,6))
#Plot training, testing & predicted values on the graph
train_set["Electricity Consumption (MWh)"].plot(ax=ax[0], label="training", color='dodgerblue')
test_set["Electricity Consumption (MWh)"].plot(ax=ax[0], label="testing", color='green')
pred_full_Elec.plot(ax=ax[0], label='prediction', color="orange")
ax[0].legend()
ax[0].set_title(f"SARIMA for Electricity Consumption (MWh)", fontsize=16)
ax[0].grid(True)
test_set["Electricity Consumption (MWh)"].plot(ax=ax[1], label="testing", color='green')
electricity.plot(ax=ax[1], label='prediction', color="orange")
ax[1].legend()
ax[1].set_title(f"Closer look into Test and Predicted Values", fontsize=16)
ax[1].grid(True)
plt.tight_layout()
plt.show()
Observations:
print(f'Electricity Consumption:')
print(f'Train MAPE: {elec_train_MAPE}%')
print(f'Train RMSE: {elec_train_RMSE}')
print(f'Validation MAPE: {elec_validate_MAPE}%')
print(f'Validation RMSE: {elec_validate_RMSE}')
print()
Electricity Consumption: Train MAPE: 2.536% Train RMSE: 50.23 Validation MAPE: 2.436% Validation RMSE: 31.608
Observations:
#print SARIMAX summary for Electricity Consumption
sarimax_model_Elec.summary()
| Dep. Variable: | Electricity Consumption (MWh) | No. Observations: | 317 |
|---|---|---|---|
| Model: | SARIMAX(2, 1, 3)x(0, 1, [1], 12) | Log Likelihood | -1373.238 |
| Date: | Thu, 10 Aug 2023 | AIC | 2760.476 |
| Time: | 07:34:55 | BIC | 2786.495 |
| Sample: | 01-01-1990 | HQIC | 2770.884 |
| - 05-01-2016 | |||
| Covariance Type: | opg |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| ar.L1 | -0.7202 | 0.144 | -4.989 | 0.000 | -1.003 | -0.437 |
| ar.L2 | 0.2542 | 0.122 | 2.077 | 0.038 | 0.014 | 0.494 |
| ma.L1 | 0.3331 | 0.140 | 2.382 | 0.017 | 0.059 | 0.607 |
| ma.L2 | -0.8977 | 0.067 | -13.439 | 0.000 | -1.029 | -0.767 |
| ma.L3 | -0.2715 | 0.101 | -2.696 | 0.007 | -0.469 | -0.074 |
| ma.S.L12 | -0.7267 | 0.050 | -14.551 | 0.000 | -0.825 | -0.629 |
| sigma2 | 473.8149 | 29.857 | 15.869 | 0.000 | 415.296 | 532.334 |
| Ljung-Box (L1) (Q): | 0.04 | Jarque-Bera (JB): | 48.50 |
|---|---|---|---|
| Prob(Q): | 0.84 | Prob(JB): | 0.00 |
| Heteroskedasticity (H): | 2.51 | Skew: | -0.32 |
| Prob(H) (two-sided): | 0.00 | Kurtosis: | 4.85 |
Observations:
#Plot diagnostic plot
fig = sarimax_model_Elec.plot_diagnostics()
ax = fig.axes[0]
ax.set_xlim(250, 557)
ax.set_ylim(-6,4)
ax.axhline(0)
plt.show()
Observations:
#Try out p=1, q=1
varma_model = VARMAX(endog=train_set, order=(1,1), enforce_stationary=False, enforce_invertibility=False).fit(disp=False)
#Summary
varma_model.summary()
| Dep. Variable: | ['Gas Consumption (tons)', 'Electricity Consumption (MWh)', 'Water Consumption (tons)'] | No. Observations: | 317 |
|---|---|---|---|
| Model: | VARMA(1,1) | Log Likelihood | -4487.343 |
| + intercept | AIC | 9028.687 | |
| Date: | Thu, 10 Aug 2023 | BIC | 9130.177 |
| Time: | 07:34:57 | HQIC | 9069.227 |
| Sample: | 01-01-1990 | ||
| - 05-01-2016 | |||
| Covariance Type: | opg |
| Ljung-Box (L1) (Q): | 0.83, 2.46, 0.82 | Jarque-Bera (JB): | 1868.22, 7.44, 75.18 |
|---|---|---|---|
| Prob(Q): | 0.36, 0.12, 0.36 | Prob(JB): | 0.00, 0.02, 0.00 |
| Heteroskedasticity (H): | 2.11, 2.45, 1.43 | Skew: | 0.67, 0.35, -0.52 |
| Prob(H) (two-sided): | 0.00, 0.00, 0.07 | Kurtosis: | 14.82, 3.28, 5.15 |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| intercept | 7.9276 | 2.420 | 3.275 | 0.001 | 3.184 | 12.672 |
| L1.Gas Consumption (tons) | 0.7272 | 0.075 | 9.703 | 0.000 | 0.580 | 0.874 |
| L1.Electricity Consumption (MWh) | 0.0010 | 0.001 | 0.737 | 0.461 | -0.002 | 0.004 |
| L1.Water Consumption (tons) | -0.0051 | 0.003 | -1.839 | 0.066 | -0.010 | 0.000 |
| L1.e(Gas Consumption (tons)) | -0.1175 | 0.077 | -1.532 | 0.126 | -0.268 | 0.033 |
| L1.e(Electricity Consumption (MWh)) | 0.0010 | 0.005 | 0.185 | 0.854 | -0.009 | 0.011 |
| L1.e(Water Consumption (tons)) | 0.0090 | 0.004 | 2.249 | 0.025 | 0.001 | 0.017 |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| intercept | 60.8961 | 88.367 | 0.689 | 0.491 | -112.299 | 234.091 |
| L1.Gas Consumption (tons) | 3.1787 | 2.506 | 1.268 | 0.205 | -1.734 | 8.091 |
| L1.Electricity Consumption (MWh) | 0.7600 | 0.072 | 10.621 | 0.000 | 0.620 | 0.900 |
| L1.Water Consumption (tons) | 0.1394 | 0.095 | 1.462 | 0.144 | -0.047 | 0.326 |
| L1.e(Gas Consumption (tons)) | -0.3897 | 2.028 | -0.192 | 0.848 | -4.365 | 3.586 |
| L1.e(Electricity Consumption (MWh)) | 0.6118 | 0.082 | 7.494 | 0.000 | 0.452 | 0.772 |
| L1.e(Water Consumption (tons)) | -0.0808 | 0.090 | -0.901 | 0.368 | -0.257 | 0.095 |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| intercept | 78.2474 | 59.258 | 1.320 | 0.187 | -37.895 | 194.390 |
| L1.Gas Consumption (tons) | -1.9881 | 1.891 | -1.051 | 0.293 | -5.695 | 1.718 |
| L1.Electricity Consumption (MWh) | 0.1004 | 0.043 | 2.333 | 0.020 | 0.016 | 0.185 |
| L1.Water Consumption (tons) | 0.7587 | 0.082 | 9.307 | 0.000 | 0.599 | 0.918 |
| L1.e(Gas Consumption (tons)) | 3.5529 | 2.364 | 1.503 | 0.133 | -1.081 | 8.187 |
| L1.e(Electricity Consumption (MWh)) | -0.1302 | 0.136 | -0.956 | 0.339 | -0.397 | 0.137 |
| L1.e(Water Consumption (tons)) | -0.3076 | 0.105 | -2.934 | 0.003 | -0.513 | -0.102 |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| sqrt.var.Gas Consumption (tons) | 3.6720 | 0.109 | 33.842 | 0.000 | 3.459 | 3.885 |
| sqrt.cov.Gas Consumption (tons).Electricity Consumption (MWh) | 3.1604 | 6.860 | 0.461 | 0.645 | -10.284 | 16.605 |
| sqrt.var.Electricity Consumption (MWh) | 69.9175 | 4.016 | 17.408 | 0.000 | 62.045 | 77.790 |
| sqrt.cov.Gas Consumption (tons).Water Consumption (tons) | -51.8973 | 5.403 | -9.605 | 0.000 | -62.488 | -41.307 |
| sqrt.cov.Electricity Consumption (MWh).Water Consumption (tons) | 3.0364 | 6.911 | 0.439 | 0.660 | -10.509 | 16.581 |
| sqrt.var.Water Consumption (tons) | 93.9102 | 3.572 | 26.293 | 0.000 | 86.910 | 100.911 |
Observations:
#Diagnostics plot for Gas Consumption
fig=varma_model.plot_diagnostics(variable=0)
plt.gcf().suptitle('Gas Consumption - Diagnostics', fontsize=18)
ax = fig.axes[0]
ax.set_xlim(250, 557)
ax.set_ylim(-5,9)
ax.axhline(0)
plt.tight_layout()
plt.show()
Observations for Gas Consumption:
#Diagnostics plot for Electricity Consumption
fig=varma_model.plot_diagnostics(variable=1)
plt.gcf().suptitle('Electricity Consumption - Diagnostics', fontsize=18)
ax = fig.axes[0]
ax.set_xlim(240, 555)
ax.set_ylim(-5,9)
ax.axhline(0)
plt.tight_layout()
plt.show()
Observations for Electricity Consumption:
#Diagnostics plot for Water Consumption
fig=varma_model.plot_diagnostics(variable=2)
plt.gcf().suptitle('Water Consumption - Diagnostics', fontsize=18)
ax = fig.axes[0]
ax.set_xlim(240, 555)
ax.set_ylim(-5,9)
ax.axhline(0)
plt.tight_layout()
plt.show()
Observations for Water Consumption:
# Walk forward cross validation for VARMA
def walk_forward_validate_varma(columns, train_set, test_set, model, order):
#store predictions
trainPredictions=[]
testPredictions=[]
fullPredictions=[]
data=train_set[columns].values
model_fit = model
trainPredictions = model_fit.get_prediction(start='1990-01-01', end='2016-05-01').predicted_mean
for t in test_set[columns].values:
model = VARMAX(endog=data, order=order, enforce_stationary=False, enforce_invertibility=False)
model_fit = model.fit(disp=False)
#predicting the next time step
y_test = model_fit.forecast(steps=1)
#appends one step ahead prediction
testPredictions.append(y_test[0])
#adds current value from test set to data array which include actual value from the history
#to predict the next iteration prediction
data = np.append(data, [t], axis=0)
#Concatenate full predictions
fullPredictions=np.concatenate((trainPredictions, testPredictions))
#calculate the aic & bic scores based on the order
aic=model_fit.aic
bic=model_fit.bic
#convert everything to dataframe
trainPredicted = pd.DataFrame(data=trainPredictions, columns=columns, index=train_set.index)
testPredicted = pd.DataFrame(data=testPredictions, columns=columns, index=test_set.index)
pred_full_energy = pd.DataFrame(data=fullPredictions, columns=columns, index=energyConsumption_df.index)
#calculate the MAPE & RMSE scores for train set and validaion set based on the order
train_MAPE = round(mean_absolute_percentage_error(train_set[columns], trainPredicted)*100,3)
train_RMSE = round(mean_squared_error(train_set[columns], trainPredicted, squared=False),3)
validate_MAPE = round(mean_absolute_percentage_error(test_set[columns], testPredicted)*100, 3)
validate_RMSE = round(mean_squared_error(test_set[columns], testPredicted, squared=False),3)
return(
trainPredicted, testPredicted, pred_full_energy, model, aic, bic,
train_MAPE, train_RMSE, validate_MAPE, validate_RMSE)
#VARMA Predictions for train and test data
columns = ['Gas Consumption (tons)', 'Electricity Consumption (MWh)', 'Water Consumption (tons)']
varma_model = VARMAX(endog=train_set[columns], order=(1,1)).fit(disp=False)
(trainPredicted, testPredicted, pred_full, varma_model, aic, bic,
v_Train_MAPE, v_Train_RMSE, v_Validate_MAPE, v_validate_RMSE) = walk_forward_validate_varma(
columns, train_set, test_set, varma_model, (1,1)
)
display(pred_full.head())
| Gas Consumption (tons) | Electricity Consumption (MWh) | Water Consumption (tons) | |
|---|---|---|---|
| DATE | |||
| 1990-01-01 | 23.486793 | 839.552364 | 480.699104 |
| 1990-02-01 | 19.871191 | 732.799785 | 507.893358 |
| 1990-03-01 | 18.466696 | 720.768855 | 556.041754 |
| 1990-04-01 | 18.325582 | 613.990708 | 509.973274 |
| 1990-05-01 | 19.593443 | 606.845628 | 493.557555 |
fig,ax = plt.subplots(3, 1)
fig.suptitle("VARMA Actual VS Predicted", fontsize=14, fontweight="bold")
#Plot actual & predicted values on the graph
index=0
for column in energyConsumption_df.columns:
energyConsumption_df[column].plot(ax=ax[index], label="actual", color='dodgerblue')
pred_full[column].plot(ax=ax[index], label='predicted', linestyle='--', color='orange')
ax[index].legend()
ax[index].set_title(f'{column}')
index+=1
plt.tight_layout()
plt.show()
Observations:
#Impulse-Response Function
varma_model = VARMAX(endog=train_set[columns], order=(1,1)).fit(disp=False)
gas_irfs = varma_model.impulse_responses(24, impulse=0, orthogonalized=True)
electricity_irfs = varma_model.impulse_responses(24, impulse=1, orthogonalized=True)
water_irfs = varma_model.impulse_responses(24, impulse=2, orthogonalized=True)
#Plot impulse reaction graph for better visualisation
fig, ax = plt.subplots(3, 1, figsize=(15,9))
index=0
irfs=[gas_irfs, electricity_irfs, water_irfs]
for col in columns:
ax[index].plot(range(0,25), irfs[index], marker='o', label=["Gas","Electricity","Water"])
ax[index].set_xlabel("Time Steps")
ax[index].set_ylabel("Impulse-Response Value")
ax[index].set_title(f"Impulse-Response Function - {col}")
ax[index].legend()
ax[index].grid(True)
index+=1
plt.tight_layout()
plt.show()
Observations:
Determines whether there is evidence of autocorrelation (serial correlation) in the residuals of a regression analysis. Checks for the presence of first-order autocorrelation. Test statistic take the values between 0 and 4. A value close to 2 suggests no autocorrelation (null hypothesis cannot be rejected). A value significantly less than 2 indicates positive autocorrelation (rejection of null hypothesis in favor of alternative). A value significantly greater than 2 indicates negative autocorrelation (rejection of null hypothesis in favor of alternative)
#Conduct durbin watson test
def durbin_watson_test(df, resid=None):
cols, stat = [], []
out = durbin_watson(resid)
for col, val in zip(df.columns, out):
cols.append(col)
stat.append(round(val, 2))
dw_test = pd.DataFrame(stat, index=cols, columns=['Durbin Watson Test Statistic'])
return dw_test
#Perform durbin watson test
dw_results = durbin_watson_test(df=energyConsumption_df, resid=varma_model.resid)
display(dw_results)
| Durbin Watson Test Statistic | |
|---|---|
| Gas Consumption (tons) | 2.01 |
| Electricity Consumption (MWh) | 1.78 |
| Water Consumption (tons) | 2.00 |
Observations:
#Score Metrics
print(f'Energy Consumption Score Metrics:')
print(f'Train MAPE: {v_Train_MAPE}%')
print(f'Train RMSE: {v_Train_RMSE}')
print(f'Validation MAPE: {v_Validate_MAPE}%')
print(f'Validation RMSE: {v_validate_RMSE}')
print()
Energy Consumption Score Metrics: Train MAPE: 13.028% Train RMSE: 55.019 Validation MAPE: 11.149% Validation RMSE: 57.356
Observations:
scoreMetrics_values={}
#Values of p & q
p_values = np.arange(0, 3)
q_values = np.arange(0, 8)
for i in list(product(p_values,q_values)):
order=(i[0], 1, i[1])
model = ARIMA(order=order)
(gas_trainPredicted, gas, pred_full_Gas, arima_modelGas, gas_aic, gas_bic,
gas_train_MAPE, gas_train_RMSE, gas_validate_MAPE, gas_validate_RMSE) = walk_forward_validate(
"Gas Consumption (tons)", train_set, test_set, model, energyConsumption_df
)
#concatenate order as a string
order_string=f'({i[0]},1,{i[1]})'
#store the score metrics in the dictionaries
scoreMetrics_values[order_string] = {'AIC': round(gas_aic,3), 'BIC': round(gas_bic,3), 'Train MAPE': gas_train_MAPE,
'Train RMSE': gas_train_RMSE, 'Validate MAPE': gas_validate_MAPE,
'Validate RMSE': gas_validate_RMSE}
#Convert the dictionary to dataframe
scoreMetrics_Gas_df = pd.DataFrame.from_dict(scoreMetrics_values, orient='index').sort_values(
by=["BIC","Validate MAPE", "Train MAPE","Validate RMSE","Train RMSE"],
ascending=[True, True, True,True,True])
display(scoreMetrics_Gas_df)
| AIC | BIC | Train MAPE | Train RMSE | Validate MAPE | Validate RMSE | |
|---|---|---|---|---|---|---|
| (1,1,1) | 2116.039 | 2131.965 | 11.345 | 3.716 | 9.344 | 2.961 |
| (2,1,1) | 2116.763 | 2136.671 | 11.344 | 3.708 | 9.426 | 2.969 |
| (1,1,2) | 2116.935 | 2136.843 | 11.329 | 3.710 | 9.319 | 2.968 |
| (0,1,3) | 2118.446 | 2138.353 | 11.598 | 3.713 | 9.605 | 2.991 |
| (2,1,2) | 2118.041 | 2141.930 | 11.402 | 3.703 | 9.395 | 2.978 |
| (1,1,3) | 2118.526 | 2142.414 | 11.403 | 3.704 | 9.408 | 2.992 |
| (0,1,2) | 2127.328 | 2143.254 | 11.823 | 3.772 | 9.496 | 2.976 |
| (0,1,4) | 2119.495 | 2143.383 | 11.466 | 3.706 | 9.514 | 3.005 |
| (0,1,5) | 2119.048 | 2146.918 | 11.352 | 3.699 | 9.371 | 2.979 |
| (2,1,3) | 2120.028 | 2147.897 | 11.412 | 3.703 | 9.424 | 2.982 |
| (1,1,4) | 2121.440 | 2149.310 | 11.549 | 3.705 | 9.625 | 3.006 |
| (0,1,6) | 2120.609 | 2152.460 | 11.414 | 3.694 | 9.422 | 2.995 |
| (2,1,4) | 2122.012 | 2153.863 | 11.378 | 3.699 | 9.434 | 3.007 |
| (1,1,5) | 2122.919 | 2154.770 | 11.452 | 3.701 | 9.559 | 3.016 |
| (0,1,1) | 2143.306 | 2155.250 | 11.742 | 3.854 | 9.417 | 3.044 |
| (2,1,6) | 2117.259 | 2157.073 | 11.141 | 3.659 | 9.428 | 2.999 |
| (1,1,6) | 2121.374 | 2157.207 | 11.283 | 3.690 | 9.418 | 2.981 |
| (0,1,7) | 2122.605 | 2158.437 | 11.415 | 3.693 | 9.442 | 3.003 |
| (2,1,5) | 2123.838 | 2159.670 | 11.380 | 3.699 | 9.431 | 3.004 |
| (1,1,7) | 2122.061 | 2161.875 | 11.318 | 3.680 | 9.495 | 3.004 |
| (2,1,7) | 2119.095 | 2162.890 | 11.157 | 3.658 | 9.418 | 3.006 |
| (2,1,0) | 2157.816 | 2173.741 | 11.318 | 3.917 | 9.233 | 3.060 |
| (1,1,0) | 2166.866 | 2178.810 | 11.221 | 3.958 | 9.260 | 3.151 |
| (0,1,0) | 2205.936 | 2213.899 | 11.105 | 4.180 | 9.357 | 3.192 |
lowestGasBIC_order = scoreMetrics_Gas_df["BIC"].idxmin()
lowestGasBIC_value = scoreMetrics_Gas_df["BIC"].min()
print(f"Order with Lowest BIC value: {lowestGasBIC_order}")
print(f"Lowest BIC value: {lowestGasBIC_value}")
lowestGasBIC_cols = scoreMetrics_Gas_df.loc[lowestGasBIC_order]
display(lowestGasBIC_cols)
Order with Lowest BIC value: (1,1,1) Lowest BIC value: 2131.965
AIC 2116.039 BIC 2131.965 Train MAPE 11.345 Train RMSE 3.716 Validate MAPE 9.344 Validate RMSE 2.961 Name: (1,1,1), dtype: float64
Observations:
Gas_ARIMA_model = ARIMA(order=(1,1,1)).fit(energyConsumption_df["Gas Consumption (tons)"])
Gas_ARIMA_model.plot_diagnostics(figsize=(14,10))
plt.show()
Observations:
Gas_ARIMA_model.summary()
| Dep. Variable: | y | No. Observations: | 397 |
|---|---|---|---|
| Model: | SARIMAX(1, 1, 1) | Log Likelihood | -1054.019 |
| Date: | Thu, 10 Aug 2023 | AIC | 2116.039 |
| Time: | 08:13:24 | BIC | 2131.965 |
| Sample: | 01-01-1990 | HQIC | 2122.348 |
| - 01-01-2023 | |||
| Covariance Type: | opg |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| intercept | 0.0133 | 0.018 | 0.728 | 0.467 | -0.023 | 0.049 |
| ar.L1 | 0.4336 | 0.037 | 11.732 | 0.000 | 0.361 | 0.506 |
| ma.L1 | -0.9013 | 0.032 | -28.510 | 0.000 | -0.963 | -0.839 |
| sigma2 | 11.9797 | 0.429 | 27.933 | 0.000 | 11.139 | 12.820 |
| Ljung-Box (L1) (Q): | 0.22 | Jarque-Bera (JB): | 2255.67 |
|---|---|---|---|
| Prob(Q): | 0.64 | Prob(JB): | 0.00 |
| Heteroskedasticity (H): | 2.21 | Skew: | 0.57 |
| Prob(H) (two-sided): | 0.00 | Kurtosis: | 14.64 |
Observations:
#Gas Consumption Predictions for train and test data
model = ARIMA(order=(1,1,1))
(gas_trainPredicted, gas, pred_full_Gas, arima_modelGas, gas_aic, gas_bic,
gas_train_MAPE, gas_train_RMSE, gas_validate_MAPE, gas_validate_RMSE) = walk_forward_validate(
"Gas Consumption (tons)", train_set, test_set, model, energyConsumption_df
)
display(pred_full_Gas.head())
DATE 1990-01-01 -0.000792 1990-02-01 17.999123 1990-03-01 16.423710 1990-04-01 17.258491 1990-05-01 18.252177 dtype: float64
#Visualise train and test time series in a time series plot
fig, ax = plt.subplots(2,1,figsize=(13,6))
#Plot training, testing & predicted values on the graph
train_set["Gas Consumption (tons)"].plot(ax=ax[0], label="training", color='dodgerblue')
test_set["Gas Consumption (tons)"].plot(ax=ax[0], label="testing", color='green')
pred_full_Gas.plot(ax=ax[0], label='prediction', color="orange")
ax[0].legend()
ax[0].set_title(f"ARIMA for Gas Consumption (tons)", fontsize=16)
ax[0].set_yticks(np.arange(0,51,10))
ax[0].grid(True)
#Focus on the predicted values and the test values
test_set["Gas Consumption (tons)"].plot(ax=ax[1], label="testing", color='green')
gas.plot(ax=ax[1], label='prediction', color="orange")
ax[1].legend()
ax[1].set_title(f"Closer look into Test and Predicted Values", fontsize=16)
ax[1].set_yticks(np.arange(0,51,10))
ax[1].grid(True)
plt.tight_layout()
plt.show()
Observations:
scoreMetrics_values={}
#Values of p & q
p_values = np.arange(0, 4)
q_values = np.arange(0, 5)
for i in list(product(p_values,q_values)):
order=(i[0], 1, i[1])
model = ARIMA(order=order)
(water_trainPredicted, water, pred_full_Water, arima_modelWater, water_aic, water_bic,
water_train_MAPE, water_train_RMSE, water_validate_MAPE, water_validate_RMSE) = walk_forward_validate(
"Water Consumption (tons)", train_set, test_set, model, energyConsumption_df
)
#concatenate order as a string
order_string=f'({i[0]},1,{i[1]})'
#store the score metrics in the dictionaries
scoreMetrics_values[order_string] = {'AIC': round(water_aic,3), 'BIC': round(water_bic,3),
'Train MAPE': water_train_MAPE, 'Train RMSE': water_train_RMSE,
'Validate MAPE': water_validate_MAPE, 'Validate RMSE': water_validate_RMSE}
#Convert the dictionary to dataframe
scoreMetrics_Water_df = pd.DataFrame.from_dict(scoreMetrics_values, orient='index').sort_values(
by=["BIC","Validate MAPE", "Train MAPE","Validate RMSE","Train RMSE"],
ascending=[True, True, True,True,True])
display(scoreMetrics_Water_df)
| AIC | BIC | Train MAPE | Train RMSE | Validate MAPE | Validate RMSE | |
|---|---|---|---|---|---|---|
| (1,1,1) | 4790.987 | 4806.913 | 22.076 | 108.206 | 17.147 | 91.563 |
| (1,1,2) | 4789.982 | 4809.889 | 22.176 | 108.025 | 16.839 | 90.383 |
| (2,1,1) | 4790.225 | 4810.132 | 22.157 | 108.043 | 16.850 | 90.430 |
| (2,1,2) | 4791.924 | 4815.813 | 22.174 | 107.988 | 16.890 | 90.617 |
| (1,1,3) | 4791.933 | 4815.822 | 22.185 | 108.001 | 16.879 | 90.562 |
| (3,1,1) | 4792.014 | 4815.903 | 22.172 | 108.025 | 16.875 | 90.449 |
| (0,1,3) | 4796.227 | 4816.134 | 22.110 | 108.422 | 17.312 | 93.050 |
| (0,1,2) | 4802.317 | 4818.243 | 22.107 | 109.198 | 17.670 | 95.287 |
| (0,1,4) | 4794.376 | 4818.264 | 22.071 | 108.164 | 17.013 | 91.451 |
| (2,1,3) | 4791.966 | 4819.836 | 22.175 | 107.806 | 16.771 | 90.067 |
| (1,1,4) | 4793.902 | 4821.772 | 22.123 | 108.245 | 17.087 | 91.325 |
| (3,1,2) | 4793.927 | 4821.797 | 22.179 | 107.987 | 16.867 | 90.601 |
| (2,1,4) | 4793.793 | 4825.644 | 22.225 | 107.855 | 16.772 | 90.078 |
| (0,1,1) | 4814.100 | 4826.044 | 22.588 | 111.539 | 17.005 | 94.140 |
| (3,1,3) | 4795.863 | 4827.714 | 21.940 | 107.465 | 17.229 | 92.607 |
| (3,1,4) | 4795.948 | 4831.781 | 21.751 | 107.564 | 16.918 | 90.987 |
| (3,1,0) | 4821.964 | 4841.871 | 22.517 | 112.380 | 16.675 | 93.127 |
| (2,1,0) | 4829.537 | 4845.462 | 22.433 | 113.995 | 16.203 | 92.509 |
| (1,1,0) | 4843.533 | 4855.478 | 22.763 | 116.527 | 16.235 | 92.506 |
| (0,1,0) | 4888.177 | 4896.140 | 22.818 | 123.119 | 16.501 | 97.938 |
lowestWaterBIC_order = scoreMetrics_Water_df["BIC"].idxmin()
lowestWaterBIC_value = scoreMetrics_Water_df["BIC"].min()
print(f"Order with Lowest BIC value: {lowestWaterBIC_order}")
print(f"Lowest BIC value: {lowestWaterBIC_value}")
lowestWaterBIC_cols = scoreMetrics_Water_df.loc[lowestWaterBIC_order]
display(lowestWaterBIC_cols)
Order with Lowest BIC value: (1,1,1) Lowest BIC value: 4806.913
AIC 4790.987 BIC 4806.913 Train MAPE 22.076 Train RMSE 108.206 Validate MAPE 17.147 Validate RMSE 91.563 Name: (1,1,1), dtype: float64
Observations:
#Diagnostic plot
Water_ARIMA_model = ARIMA(order=(1,1,1)).fit(energyConsumption_df["Water Consumption (tons)"])
Water_ARIMA_model.plot_diagnostics(figsize=(14,10))
plt.show()
Observations:
Water_ARIMA_model.summary()
| Dep. Variable: | y | No. Observations: | 397 |
|---|---|---|---|
| Model: | SARIMAX(1, 1, 1) | Log Likelihood | -2391.494 |
| Date: | Thu, 10 Aug 2023 | AIC | 4790.987 |
| Time: | 08:23:35 | BIC | 4806.913 |
| Sample: | 01-01-1990 | HQIC | 4797.296 |
| - 01-01-2023 | |||
| Covariance Type: | opg |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| intercept | -0.0600 | 0.395 | -0.152 | 0.879 | -0.834 | 0.714 |
| ar.L1 | 0.4397 | 0.045 | 9.737 | 0.000 | 0.351 | 0.528 |
| ma.L1 | -0.9294 | 0.024 | -39.224 | 0.000 | -0.976 | -0.883 |
| sigma2 | 1.028e+04 | 523.320 | 19.635 | 0.000 | 9249.795 | 1.13e+04 |
| Ljung-Box (L1) (Q): | 0.51 | Jarque-Bera (JB): | 128.13 |
|---|---|---|---|
| Prob(Q): | 0.48 | Prob(JB): | 0.00 |
| Heteroskedasticity (H): | 1.50 | Skew: | -0.30 |
| Prob(H) (two-sided): | 0.02 | Kurtosis: | 5.72 |
Observations:
#Water Consumption Predictions for train and test data
model = ARIMA(order=(1,1,1))
(water_trainPredicted, water, pred_full_Water, arima_modelWater, water_aic, water_bic,
water_train_MAPE, water_train_RMSE, water_validate_MAPE, water_validate_RMSE) = walk_forward_validate(
"Water Consumption (tons)", train_set, test_set, model, energyConsumption_df
)
display(pred_full_Water.head())
DATE 1990-01-01 0.207982 1990-02-01 546.711694 1990-03-01 611.921085 1990-04-01 536.728785 1990-05-01 534.288835 dtype: float64
#Visualise train and test time series in a time series plot
fig, ax = plt.subplots(2,1,figsize=(13,6))
#Plot training, testing & predicted values on the graph
train_set["Water Consumption (tons)"].plot(ax=ax[0], label="training", color='dodgerblue')
test_set["Water Consumption (tons)"].plot(ax=ax[0], label="testing", color='green')
pred_full_Water.plot(ax=ax[0], label='prediction', color="orange")
ax[0].legend()
ax[0].set_title(f"ARIMA for Water Consumption (tons)", fontsize=16)
ax[0].grid(True)
test_set["Water Consumption (tons)"].plot(ax=ax[1], label="testing", color='green')
water.plot(ax=ax[1], label='prediction', color="orange")
ax[1].legend()
ax[1].set_title(f"Closer look into Test and Predicted Values", fontsize=16)
ax[1].grid(True)
plt.tight_layout()
plt.show()
Observations:
scoreMetrics_values={}
#Values of p, q, P & Q
p_values = np.arange(0, 3)
q_values = np.arange(0, 4)
P_values = np.arange(0, 3)
Q_values = np.arange(0, 3)
for i in list(product(p_values, q_values, P_values, Q_values)):
order = (i[0], 1, i[1])
seasonal_order = (i[2], 1, i[3], 12)
model = ARIMA(order=order, seasonal_order=seasonal_order)
(electricity_trainPredicted, electricity, pred_full_Electricity, arima_modelElectricity, elec_aic, elec_bic,
elec_train_MAPE, elec_train_RMSE, elec_validate_MAPE, elec_validate_RMSE) = walk_forward_validate(
"Electricity Consumption (MWh)", train_set, test_set, model, energyConsumption_df
)
#concatenate order as a string
order_string=f'({i[0]},1,{i[1]}), ({i[2]},1,{i[3]},12)'
#store the score metrics in the dictionaries
scoreMetrics_values[order_string] = {'AIC': round(elec_aic,3), 'BIC': round(elec_bic,3),
'Train MAPE': elec_train_MAPE, 'Train RMSE': elec_train_RMSE,
'Validate MAPE': elec_validate_MAPE, 'Validate RMSE': elec_validate_RMSE}
#Convert the dictionary to dataframe
scoreMetrics_Electricity_df = pd.DataFrame.from_dict(scoreMetrics_values, orient='index').sort_values(
by=["BIC","Validate MAPE", "Train MAPE","Validate RMSE","Train RMSE"],
ascending=[True, True, True,True,True])
display(scoreMetrics_Electricity_df.head(15))
| AIC | BIC | Train MAPE | Train RMSE | Validate MAPE | Validate RMSE | |
|---|---|---|---|---|---|---|
| (1,1,1), (2,1,1,12) | 3540.218 | 3567.872 | 2.486 | 50.117 | 2.303 | 30.474 |
| (1,1,1), (0,1,1,12) | 3550.888 | 3570.641 | 2.503 | 50.214 | 2.394 | 31.265 |
| (2,1,1), (2,1,1,12) | 3540.350 | 3571.955 | 2.482 | 50.101 | 2.280 | 30.339 |
| (1,1,2), (2,1,1,12) | 3541.214 | 3572.819 | 2.485 | 50.092 | 2.290 | 30.391 |
| (0,1,3), (2,1,1,12) | 3543.050 | 3574.655 | 2.507 | 50.128 | 2.281 | 30.494 |
| (1,1,2), (0,1,1,12) | 3551.175 | 3574.879 | 2.498 | 50.185 | 2.413 | 31.561 |
| (2,1,1), (0,1,1,12) | 3551.651 | 3575.355 | 2.503 | 50.199 | 2.404 | 31.449 |
| (1,1,1), (0,1,2,12) | 3551.755 | 3575.459 | 2.507 | 50.211 | 2.375 | 31.396 |
| (1,1,1), (1,1,1,12) | 3552.241 | 3575.945 | 2.506 | 50.212 | 2.382 | 31.322 |
| (0,1,2), (2,1,1,12) | 3548.394 | 3576.048 | 2.538 | 50.197 | 2.318 | 30.942 |
| (1,1,2), (2,1,2,12) | 3540.723 | 3576.279 | 2.471 | 50.067 | 2.324 | 30.547 |
| (1,1,1), (1,1,2,12) | 3550.219 | 3577.873 | 2.506 | 50.186 | 2.341 | 31.043 |
| (2,1,2), (2,1,1,12) | 3542.989 | 3578.545 | 2.476 | 50.074 | 2.294 | 30.325 |
| (2,1,1), (2,1,2,12) | 3543.524 | 3579.079 | 2.470 | 50.071 | 2.322 | 30.700 |
| (1,1,2), (0,1,2,12) | 3552.451 | 3580.106 | 2.505 | 50.185 | 2.399 | 31.488 |
lowestElectricityBIC_order = scoreMetrics_Electricity_df["BIC"].idxmin()
lowestElectricityBIC_value = scoreMetrics_Electricity_df["BIC"].min()
print(f"Order with Lowest BIC value: {lowestElectricityBIC_order}")
print(f"Lowest BIC value: {lowestElectricityBIC_value}")
lowestElectricityBIC_cols = scoreMetrics_Electricity_df.loc[lowestElectricityBIC_order]
display(lowestElectricityBIC_cols)
Order with Lowest BIC value: (1,1,1), (2,1,1,12) Lowest BIC value: 3567.872
AIC 3540.218 BIC 3567.872 Train MAPE 2.486 Train RMSE 50.117 Validate MAPE 2.303 Validate RMSE 30.474 Name: (1,1,1), (2,1,1,12), dtype: float64
Observations:
scoreMetrics_Electricity_df.sort_values(
by=["Validate MAPE", "BIC" ,"Train MAPE","Validate RMSE","Train RMSE"],
ascending=[True, True, True,True,True])
| AIC | BIC | Train MAPE | Train RMSE | Validate MAPE | Validate RMSE | |
|---|---|---|---|---|---|---|
| (2,1,1), (2,1,1,12) | 3540.350 | 3571.955 | 2.482 | 50.101 | 2.280 | 30.339 |
| (0,1,3), (2,1,1,12) | 3543.050 | 3574.655 | 2.507 | 50.128 | 2.281 | 30.494 |
| (2,1,2), (2,1,2,12) | 3541.906 | 3581.413 | 2.473 | 50.075 | 2.281 | 30.264 |
| (1,1,2), (2,1,1,12) | 3541.214 | 3572.819 | 2.485 | 50.092 | 2.290 | 30.391 |
| (2,1,2), (2,1,1,12) | 3542.989 | 3578.545 | 2.476 | 50.074 | 2.294 | 30.325 |
| ... | ... | ... | ... | ... | ... | ... |
| (0,1,1), (0,1,0,12) | 3742.753 | 3754.605 | 3.042 | 53.297 | 3.001 | 40.805 |
| (1,1,0), (0,1,0,12) | 3764.756 | 3776.608 | 3.124 | 53.772 | 3.016 | 41.643 |
| (0,1,0), (0,1,0,12) | 3780.555 | 3788.457 | 3.157 | 54.117 | 3.021 | 42.591 |
| (2,1,0), (1,1,0,12) | 3678.675 | 3698.429 | 2.827 | 51.804 | 3.033 | 38.875 |
| (0,1,1), (1,1,0,12) | 3693.682 | 3709.485 | 2.832 | 52.044 | 3.104 | 39.976 |
108 rows × 6 columns
lowestElectricityMAPE_order = scoreMetrics_Electricity_df["Validate MAPE"].idxmin()
lowestElectricityMAPE_value = scoreMetrics_Electricity_df["Validate MAPE"].min()
print(f"Order with Lowest Validate MAPE value: {lowestElectricityMAPE_order}")
print(f"Lowest Validate MAPE value: {lowestElectricityMAPE_value}")
lowestElectricityMAPE_cols = scoreMetrics_Electricity_df.loc[lowestElectricityMAPE_order]
display(lowestElectricityMAPE_cols)
Order with Lowest Validate MAPE value: (2,1,1), (2,1,1,12) Lowest Validate MAPE value: 2.28
AIC 3540.350 BIC 3571.955 Train MAPE 2.482 Train RMSE 50.101 Validate MAPE 2.280 Validate RMSE 30.339 Name: (2,1,1), (2,1,1,12), dtype: float64
Observations:
#Electricity Consumption Predictions for train and test data
model = ARIMA(order=(2,1,1), seasonal_order=(2,1,1,12))
(electricity_trainPredicted, electricity, pred_full_Elec, arima_modelElec, elec_aic, elec_bic,
elec_train_MAPE, elec_train_RMSE, elec_validate_MAPE, elec_validate_RMSE) = walk_forward_validate(
"Electricity Consumption (MWh)", train_set, test_set, model, energyConsumption_df
)
display(pred_full_Elec.head())
DATE 1990-01-01 -0.054315 1990-02-01 724.986589 1990-03-01 706.595080 1990-04-01 624.434714 1990-05-01 574.655182 dtype: float64
#Visualise train and test time series in a time series plot
fig, ax = plt.subplots(2,1,figsize=(13,6))
#Plot training, testing & predicted values on the graph
train_set["Electricity Consumption (MWh)"].plot(ax=ax[0], label="training", color='dodgerblue')
test_set["Electricity Consumption (MWh)"].plot(ax=ax[0], label="testing", color='green')
pred_full_Elec.plot(ax=ax[0], label='prediction', color="orange")
ax[0].legend()
ax[0].set_title(f"SARIMA for Electricity Consumption (MWh)", fontsize=16)
ax[0].grid(True)
test_set["Electricity Consumption (MWh)"].plot(ax=ax[1], label="testing", color='green')
electricity.plot(ax=ax[1], label='prediction', color="orange")
ax[1].legend()
ax[1].set_title(f"Closer look into Test and Predicted Values", fontsize=16)
ax[1].grid(True)
plt.tight_layout()
plt.show()
Observations:
#print summary for Electricity Consumption
arima_modelElec.summary()
| Dep. Variable: | y | No. Observations: | 397 |
|---|---|---|---|
| Model: | SARIMAX(2, 1, 1)x(2, 1, 1, 12) | Log Likelihood | -1762.175 |
| Date: | Fri, 11 Aug 2023 | AIC | 3540.350 |
| Time: | 05:46:57 | BIC | 3571.955 |
| Sample: | 0 | HQIC | 3552.886 |
| - 397 | |||
| Covariance Type: | opg |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| intercept | -0.0413 | 0.015 | -2.840 | 0.005 | -0.070 | -0.013 |
| ar.L1 | 0.5426 | 0.047 | 11.472 | 0.000 | 0.450 | 0.635 |
| ar.L2 | -0.0619 | 0.050 | -1.244 | 0.213 | -0.160 | 0.036 |
| ma.L1 | -0.9712 | 0.016 | -62.000 | 0.000 | -1.002 | -0.941 |
| ar.S.L12 | -0.0511 | 0.061 | -0.843 | 0.399 | -0.170 | 0.068 |
| ar.S.L24 | -0.2547 | 0.064 | -4.003 | 0.000 | -0.379 | -0.130 |
| ma.S.L12 | -0.6733 | 0.059 | -11.358 | 0.000 | -0.789 | -0.557 |
| sigma2 | 544.1684 | 32.582 | 16.701 | 0.000 | 480.308 | 608.029 |
| Ljung-Box (L1) (Q): | 0.00 | Jarque-Bera (JB): | 23.47 |
|---|---|---|---|
| Prob(Q): | 0.98 | Prob(JB): | 0.00 |
| Heteroskedasticity (H): | 2.90 | Skew: | 0.05 |
| Prob(H) (two-sided): | 0.00 | Kurtosis: | 4.21 |
Observations:
#Plot diagnostic plot
arima_modelElec.plot_diagnostics(figsize=(14,10))
plt.show()
Observations:
scoreMetrics_values={}
#Values of p & q
p_values = np.arange(0, 5)
q_values = np.arange(0, 5)
columns = ['Gas Consumption (tons)', 'Electricity Consumption (MWh)', 'Water Consumption (tons)']
for i in list(product(p_values, q_values)):
order = (i[0], i[1])
if order!=(0, 0):
varma_model = VARMAX(endog=train_set[columns], order=order).fit(disp=False)
(trainPredicted, testPredicted, pred_full, varma_model, aic, bic,
v_Train_MAPE, v_Train_RMSE, v_Validate_MAPE, v_validate_RMSE) = walk_forward_validate_varma(
columns, train_set, test_set, varma_model, order
)
#concatenate order as a string
order_string=f'({i[0]},{i[1]})'
#store the score metrics in the dictionaries
scoreMetrics_values[order_string] = {'AIC': round(aic,3), 'BIC': round(bic,3),
'Train MAPE': v_Train_MAPE, 'Train RMSE': v_Train_RMSE,
'Validate MAPE': v_Validate_MAPE, 'Validate RMSE': v_validate_RMSE}
#Convert the dictionary to dataframe
scoreMetrics_varma_AllCols=pd.DataFrame.from_dict(scoreMetrics_values, orient='index').sort_values(
by=["BIC","Validate MAPE", "Train MAPE","Validate RMSE","Train RMSE"],
ascending=[True, True, True,True,True])
display(scoreMetrics_varma_AllCols)
| AIC | BIC | Train MAPE | Train RMSE | Validate MAPE | Validate RMSE | |
|---|---|---|---|---|---|---|
| (4,1) | 10965.274 | 11180.270 | 12.270 | 48.256 | 10.413 | 48.440 |
| (4,0) | 11003.509 | 11182.673 | 12.294 | 49.108 | 10.400 | 49.626 |
| (3,1) | 11049.429 | 11228.593 | 12.476 | 49.603 | 10.356 | 49.713 |
| (3,2) | 11016.253 | 11231.249 | 12.348 | 49.110 | 10.242 | 48.097 |
| (4,3) | 10946.554 | 11233.216 | 12.074 | 46.907 | 10.283 | 47.382 |
| (4,2) | 10992.567 | 11243.396 | 12.237 | 48.373 | 10.337 | 48.372 |
| (4,4) | 10957.499 | 11279.994 | 12.040 | 46.528 | 10.533 | 48.545 |
| (3,0) | 11146.452 | 11289.783 | 12.605 | 51.905 | 10.724 | 53.874 |
| (3,3) | 11052.662 | 11303.491 | 12.249 | 48.415 | 10.704 | 50.333 |
| (2,1) | 11255.889 | 11399.219 | 13.007 | 54.374 | 11.160 | 55.962 |
| (1,1) | 11297.120 | 11404.618 | 13.028 | 55.019 | 11.149 | 57.356 |
| (2,0) | 11310.818 | 11418.317 | 13.183 | 56.085 | 11.306 | 59.190 |
| (2,2) | 11267.820 | 11446.983 | 13.024 | 53.845 | 11.942 | 58.709 |
| (3,4) | 11183.450 | 11470.112 | 12.408 | 49.170 | 10.630 | 51.614 |
| (1,3) | 11300.195 | 11479.359 | 13.161 | 54.000 | 10.864 | 55.258 |
| (2,3) | 11272.120 | 11487.116 | 12.888 | 53.503 | 11.800 | 57.709 |
| (1,0) | 11420.343 | 11492.009 | 13.773 | 60.142 | 11.173 | 61.756 |
| (0,3) | 11366.295 | 11509.626 | 15.563 | 64.783 | 11.219 | 56.821 |
| (1,4) | 11296.918 | 11511.915 | 13.030 | 53.255 | 10.932 | 55.006 |
| (2,4) | 11268.282 | 11519.111 | 12.930 | 53.164 | 11.748 | 57.280 |
| (1,2) | 11384.797 | 11528.128 | 13.018 | 55.792 | 11.041 | 59.102 |
| (0,4) | 11374.230 | 11553.394 | 15.791 | 66.254 | 10.994 | 58.015 |
| (0,2) | 11560.365 | 11667.863 | 16.094 | 67.997 | 11.479 | 62.410 |
| (0,1) | 11783.506 | 11855.172 | 16.706 | 69.487 | 12.811 | 69.277 |
lowestEnergyBIC_order = scoreMetrics_varma_AllCols["BIC"].idxmin()
lowestEnergyBIC_value = scoreMetrics_varma_AllCols["BIC"].min()
print(f"Order with Lowest BIC value: {lowestEnergyBIC_order}")
print(f"Lowest BIC value: {lowestEnergyBIC_value}")
lowestEnergyBIC_cols = scoreMetrics_varma_AllCols.loc[lowestEnergyBIC_order]
display(lowestEnergyBIC_cols)
Order with Lowest BIC value: (4,1) Lowest BIC value: 11180.27
AIC 10965.274 BIC 11180.270 Train MAPE 12.270 Train RMSE 48.256 Validate MAPE 10.413 Validate RMSE 48.440 Name: (4,1), dtype: float64
Observations:
#VARMA Predictions for train and test data
columns = ['Gas Consumption (tons)', 'Electricity Consumption (MWh)', 'Water Consumption (tons)']
varma_model = VARMAX(endog=train_set[columns], order=(4,1)).fit(disp=False)
(trainPredicted, testPredicted, pred_full, varma_model, aic, bic,
v_Train_MAPE, v_Train_RMSE, v_Validate_MAPE, v_validate_RMSE) = walk_forward_validate_varma(
columns, train_set, test_set, varma_model, (4,1)
)
display(pred_full.head())
| Gas Consumption (tons) | Electricity Consumption (MWh) | Water Consumption (tons) | |
|---|---|---|---|
| DATE | |||
| 1990-01-01 | 23.111158 | 849.509284 | 496.233326 |
| 1990-02-01 | 19.728519 | 754.302206 | 518.848577 |
| 1990-03-01 | 18.361752 | 751.997627 | 566.820054 |
| 1990-04-01 | 18.031029 | 624.875206 | 524.352203 |
| 1990-05-01 | 19.291235 | 614.024792 | 518.415145 |
fig,ax = plt.subplots(3, 1)
fig.suptitle("VARMA Actual VS Predicted", fontsize=14, fontweight="bold")
#Plot actual & predicted values on the graph
index=0
for column in energyConsumption_df.columns:
energyConsumption_df[column].plot(ax=ax[index], label="actual", color='dodgerblue')
pred_full[column].plot(ax=ax[index], label='predicted', linestyle='--', color='orange')
ax[index].legend()
ax[index].set_title(f'{column}')
index+=1
plt.tight_layout()
plt.show()
Observations:
#Impulse-Response Function
varma_model = VARMAX(endog=train_set[columns], order=(4,1)).fit(disp=False)
gas_irfs = varma_model.impulse_responses(24, impulse=0, orthogonalized=True)
electricity_irfs = varma_model.impulse_responses(24, impulse=1, orthogonalized=True)
water_irfs = varma_model.impulse_responses(24, impulse=2, orthogonalized=True)
#Plot impulse reaction graph for better visualisation
fig, ax = plt.subplots(3, 1, figsize=(15,9))
index=0
irfs=[gas_irfs, electricity_irfs, water_irfs]
for col in columns:
ax[index].plot(range(0,25), irfs[index], marker='o', label=["Gas","Electricity","Water"])
ax[index].set_xlabel("Time Steps")
ax[index].set_ylabel("Impulse-Response Value")
ax[index].set_title(f"Impulse-Response Function - {col}")
ax[index].legend()
ax[index].grid(True)
index+=1
plt.tight_layout()
plt.show()
Observations:
#Perform durbin watson test
dw_results = durbin_watson_test(df=energyConsumption_df, resid=varma_model.resid)
display(dw_results)
| Durbin Watson Test Statistic | |
|---|---|
| Gas Consumption (tons) | 1.99 |
| Electricity Consumption (MWh) | 1.94 |
| Water Consumption (tons) | 1.98 |
Observations:
#Diagnostics plot for Gas Consumption
fig=varma_model.plot_diagnostics(variable=0)
plt.gcf().suptitle('Gas Consumption - Diagnostics', fontsize=18)
ax = fig.axes[0]
ax.set_xlim(250, 557)
ax.set_ylim(-5,9)
ax.axhline(0)
plt.tight_layout()
plt.show()
Observations for Gas Consumption:
#Diagnostics plot for Electricity Consumption
fig=varma_model.plot_diagnostics(variable=1)
plt.gcf().suptitle('Electricity Consumption - Diagnostics', fontsize=18)
ax = fig.axes[0]
ax.set_xlim(240, 555)
ax.set_ylim(-5,9)
ax.axhline(0)
plt.tight_layout()
plt.show()
Observations for Electricity Consumption:
#Diagnostics plot for Water Consumption
fig=varma_model.plot_diagnostics(variable=2)
plt.gcf().suptitle('Water Consumption - Diagnostics', fontsize=18)
ax = fig.axes[0]
ax.set_xlim(240, 555)
ax.set_ylim(-5,9)
ax.axhline(0)
plt.tight_layout()
plt.show()
Observations for Water Consumption:
#Try out p=4, q=1
columns = ['Gas Consumption (tons)', 'Electricity Consumption (MWh)', 'Water Consumption (tons)']
varma_model = varma_model = VARMAX(endog=train_set[columns], order=(4,1)).fit(disp=False)
#Summary
varma_model.summary()
| Dep. Variable: | ['Gas Consumption (tons)', 'Electricity Consumption (MWh)', 'Water Consumption (tons)'] | No. Observations: | 317 |
|---|---|---|---|
| Model: | VARMA(4,1) | Log Likelihood | -4342.989 |
| + intercept | AIC | 8793.977 | |
| Date: | Fri, 11 Aug 2023 | BIC | 8996.958 |
| Time: | 16:21:02 | HQIC | 8875.058 |
| Sample: | 01-01-1990 | ||
| - 05-01-2016 | |||
| Covariance Type: | opg |
| Ljung-Box (L1) (Q): | 0.00, 0.10, 0.00 | Jarque-Bera (JB): | 2048.14, 9.22, 71.69 |
|---|---|---|---|
| Prob(Q): | 0.98, 0.75, 0.99 | Prob(JB): | 0.00, 0.01, 0.00 |
| Heteroskedasticity (H): | 2.16, 1.84, 1.32 | Skew: | 0.80, 0.35, -0.40 |
| Prob(H) (two-sided): | 0.00, 0.00, 0.15 | Kurtosis: | 15.35, 3.47, 5.19 |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| intercept | 7.3403 | 14.838 | 0.495 | 0.621 | -21.742 | 36.423 |
| L1.Gas Consumption (tons) | 0.5092 | 1.425 | 0.357 | 0.721 | -2.284 | 3.303 |
| L1.Electricity Consumption (MWh) | 0.0088 | 0.013 | 0.669 | 0.504 | -0.017 | 0.035 |
| L1.Water Consumption (tons) | 0.0008 | 0.050 | 0.016 | 0.987 | -0.097 | 0.098 |
| L2.Gas Consumption (tons) | 0.1539 | 0.974 | 0.158 | 0.874 | -1.755 | 2.063 |
| L2.Electricity Consumption (MWh) | -0.0147 | 0.020 | -0.726 | 0.468 | -0.054 | 0.025 |
| L2.Water Consumption (tons) | -4.867e-05 | 0.029 | -0.002 | 0.999 | -0.057 | 0.057 |
| L3.Gas Consumption (tons) | -0.0361 | 0.176 | -0.205 | 0.838 | -0.382 | 0.310 |
| L3.Electricity Consumption (MWh) | 0.0125 | 0.019 | 0.661 | 0.509 | -0.025 | 0.050 |
| L3.Water Consumption (tons) | -0.0032 | 0.004 | -0.791 | 0.429 | -0.011 | 0.005 |
| L4.Gas Consumption (tons) | 0.0809 | 0.123 | 0.659 | 0.510 | -0.160 | 0.321 |
| L4.Electricity Consumption (MWh) | -0.0050 | 0.008 | -0.612 | 0.541 | -0.021 | 0.011 |
| L4.Water Consumption (tons) | -0.0016 | 0.004 | -0.350 | 0.726 | -0.010 | 0.007 |
| L1.e(Gas Consumption (tons)) | 0.0218 | 1.433 | 0.015 | 0.988 | -2.786 | 2.830 |
| L1.e(Electricity Consumption (MWh)) | -0.0057 | 0.015 | -0.370 | 0.712 | -0.036 | 0.025 |
| L1.e(Water Consumption (tons)) | 0.0013 | 0.050 | 0.026 | 0.979 | -0.097 | 0.099 |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| intercept | 21.7296 | 159.778 | 0.136 | 0.892 | -291.429 | 334.888 |
| L1.Gas Consumption (tons) | -0.5529 | 16.260 | -0.034 | 0.973 | -32.422 | 31.316 |
| L1.Electricity Consumption (MWh) | 1.4951 | 0.151 | 9.888 | 0.000 | 1.199 | 1.791 |
| L1.Water Consumption (tons) | 0.0056 | 0.534 | 0.011 | 0.992 | -1.042 | 1.053 |
| L2.Gas Consumption (tons) | 0.5905 | 11.023 | 0.054 | 0.957 | -21.014 | 22.195 |
| L2.Electricity Consumption (MWh) | -0.9741 | 0.253 | -3.856 | 0.000 | -1.469 | -0.479 |
| L2.Water Consumption (tons) | 0.0059 | 0.319 | 0.018 | 0.985 | -0.619 | 0.631 |
| L3.Gas Consumption (tons) | -0.1692 | 1.901 | -0.089 | 0.929 | -3.895 | 3.556 |
| L3.Electricity Consumption (MWh) | 0.0475 | 0.235 | 0.202 | 0.840 | -0.413 | 0.508 |
| L3.Water Consumption (tons) | -0.0100 | 0.057 | -0.176 | 0.860 | -0.122 | 0.102 |
| L4.Gas Consumption (tons) | 0.0841 | 1.388 | 0.061 | 0.952 | -2.636 | 2.804 |
| L4.Electricity Consumption (MWh) | 0.4101 | 0.096 | 4.290 | 0.000 | 0.223 | 0.597 |
| L4.Water Consumption (tons) | -0.0065 | 0.062 | -0.105 | 0.916 | -0.128 | 0.115 |
| L1.e(Gas Consumption (tons)) | 0.3875 | 16.352 | 0.024 | 0.981 | -31.661 | 32.436 |
| L1.e(Electricity Consumption (MWh)) | -0.5081 | 0.159 | -3.192 | 0.001 | -0.820 | -0.196 |
| L1.e(Water Consumption (tons)) | -0.0293 | 0.529 | -0.055 | 0.956 | -1.065 | 1.007 |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| intercept | 82.7297 | 400.688 | 0.206 | 0.836 | -702.605 | 868.064 |
| L1.Gas Consumption (tons) | 4.4434 | 39.346 | 0.113 | 0.910 | -72.673 | 81.560 |
| L1.Electricity Consumption (MWh) | -0.0783 | 0.319 | -0.245 | 0.806 | -0.704 | 0.547 |
| L1.Water Consumption (tons) | 0.5642 | 1.303 | 0.433 | 0.665 | -1.989 | 3.118 |
| L2.Gas Consumption (tons) | -3.2983 | 26.945 | -0.122 | 0.903 | -56.110 | 49.513 |
| L2.Electricity Consumption (MWh) | 0.1979 | 0.490 | 0.404 | 0.686 | -0.762 | 1.158 |
| L2.Water Consumption (tons) | 0.0278 | 0.769 | 0.036 | 0.971 | -1.479 | 1.535 |
| L3.Gas Consumption (tons) | -0.8847 | 4.561 | -0.194 | 0.846 | -9.823 | 8.054 |
| L3.Electricity Consumption (MWh) | -0.0959 | 0.440 | -0.218 | 0.828 | -0.959 | 0.767 |
| L3.Water Consumption (tons) | 0.0506 | 0.107 | 0.471 | 0.638 | -0.160 | 0.261 |
| L4.Gas Consumption (tons) | -1.8767 | 3.625 | -0.518 | 0.605 | -8.982 | 5.229 |
| L4.Electricity Consumption (MWh) | 0.0890 | 0.213 | 0.418 | 0.676 | -0.329 | 0.507 |
| L4.Water Consumption (tons) | 0.0731 | 0.112 | 0.655 | 0.512 | -0.146 | 0.292 |
| L1.e(Gas Consumption (tons)) | 0.3860 | 39.455 | 0.010 | 0.992 | -76.945 | 77.717 |
| L1.e(Electricity Consumption (MWh)) | -0.1048 | 0.368 | -0.285 | 0.776 | -0.827 | 0.617 |
| L1.e(Water Consumption (tons)) | -0.0495 | 1.311 | -0.038 | 0.970 | -2.619 | 2.520 |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| sqrt.var.Gas Consumption (tons) | 3.5500 | 0.127 | 27.964 | 0.000 | 3.301 | 3.799 |
| sqrt.cov.Gas Consumption (tons).Electricity Consumption (MWh) | 2.2208 | 3.506 | 0.633 | 0.526 | -4.651 | 9.093 |
| sqrt.var.Electricity Consumption (MWh) | 41.8941 | 2.223 | 18.846 | 0.000 | 37.537 | 46.251 |
| sqrt.cov.Gas Consumption (tons).Water Consumption (tons) | -46.5015 | 5.396 | -8.618 | 0.000 | -57.078 | -35.925 |
| sqrt.cov.Electricity Consumption (MWh).Water Consumption (tons) | 8.2834 | 7.392 | 1.121 | 0.262 | -6.204 | 22.771 |
| sqrt.var.Water Consumption (tons) | 92.6488 | 4.063 | 22.802 | 0.000 | 84.685 | 100.612 |
Observations:
scoreMetrics_values={}
#Values of p & q
p_values = np.arange(0, 5)
q_values = np.arange(0, 5)
columns = ['Gas Consumption (tons)', 'Water Consumption (tons)']
for i in list(product(p_values, q_values)):
order = (i[0], i[1])
if order!=(0, 0):
varma_model = VARMAX(endog=train_set[columns], order=order).fit(disp=False)
(trainPredicted, testPredicted, pred_full, varma_model, aic, bic,
va_Train_MAPE, va_Train_RMSE, va_Validate_MAPE, va_validate_RMSE) = walk_forward_validate_varma(
columns, train_set, test_set, varma_model, order
)
#concatenate order as a string
order_string=f'({i[0]},{i[1]})'
#store the score metrics in the dictionaries
scoreMetrics_values[order_string] = {'AIC': round(aic,3), 'BIC': round(bic,3),
'Train MAPE': va_Train_MAPE, 'Train RMSE': va_Train_RMSE,
'Validate MAPE': va_Validate_MAPE, 'Validate RMSE': va_validate_RMSE}
#Convert the dictionary to dataframe
scoreMetrics_varma_GasWater=pd.DataFrame.from_dict(scoreMetrics_values, orient='index').sort_values(
by=["BIC","Validate MAPE", "Train MAPE","Validate RMSE","Train RMSE"],
ascending=[True, True, True,True,True])
display(scoreMetrics_varma_GasWater)
| AIC | BIC | Train MAPE | Train RMSE | Validate MAPE | Validate RMSE | |
|---|---|---|---|---|---|---|
| (1,1) | 6855.624 | 6907.383 | 16.783 | 53.829 | 12.845 | 45.666 |
| (2,0) | 6857.742 | 6909.500 | 16.817 | 54.298 | 12.710 | 45.282 |
| (1,0) | 6874.542 | 6910.374 | 17.203 | 55.770 | 12.596 | 45.357 |
| (2,1) | 6859.265 | 6926.949 | 16.828 | 53.734 | 12.828 | 46.545 |
| (3,0) | 6860.386 | 6928.070 | 16.725 | 53.806 | 12.880 | 46.076 |
| (1,2) | 6863.627 | 6931.311 | 16.742 | 53.791 | 12.808 | 46.115 |
| (4,0) | 6861.252 | 6944.861 | 16.620 | 53.434 | 12.930 | 46.470 |
| (2,2) | 6863.413 | 6947.022 | 16.739 | 53.331 | 13.028 | 46.579 |
| (3,1) | 6865.542 | 6949.151 | 16.733 | 53.552 | 12.819 | 46.485 |
| (1,3) | 6872.424 | 6956.034 | 16.844 | 54.029 | 12.777 | 46.091 |
| (3,2) | 6865.803 | 6965.338 | 16.598 | 53.092 | 13.066 | 47.259 |
| (2,3) | 6866.049 | 6965.584 | 16.654 | 53.140 | 13.069 | 47.070 |
| (4,1) | 6867.627 | 6967.162 | 16.656 | 53.170 | 13.037 | 46.824 |
| (1,4) | 6882.055 | 6981.590 | 17.093 | 54.257 | 12.828 | 45.622 |
| (3,3) | 6872.502 | 6987.963 | 16.620 | 52.921 | 13.322 | 47.517 |
| (4,2) | 6873.994 | 6989.455 | 16.573 | 53.005 | 13.087 | 46.716 |
| (0,4) | 6906.697 | 6990.306 | 18.961 | 57.230 | 12.905 | 45.286 |
| (2,4) | 6875.056 | 6990.517 | 16.750 | 53.219 | 13.235 | 47.196 |
| (0,3) | 6924.961 | 6992.645 | 18.779 | 57.246 | 13.354 | 45.689 |
| (0,2) | 6954.793 | 7006.551 | 19.101 | 58.304 | 13.359 | 46.082 |
| (4,3) | 6878.091 | 7009.478 | 16.521 | 52.774 | 13.217 | 47.405 |
| (3,4) | 6878.726 | 7010.113 | 16.612 | 52.900 | 13.310 | 47.785 |
| (4,4) | 6885.324 | 7032.636 | 16.426 | 52.675 | 13.310 | 47.759 |
| (0,1) | 7049.810 | 7085.643 | 20.823 | 61.470 | 14.862 | 48.663 |
lowestGasWaterBIC_order = scoreMetrics_varma_GasWater["BIC"].idxmin()
lowestGasWaterBIC_value = scoreMetrics_varma_GasWater["BIC"].min()
print(f"Order with Lowest BIC value: {lowestGasWaterBIC_order}")
print(f"Lowest BIC value: {lowestGasWaterBIC_value}")
lowestGasWaterBIC_cols = scoreMetrics_varma_GasWater.loc[lowestGasWaterBIC_order]
display(lowestGasWaterBIC_cols)
Order with Lowest BIC value: (1,1) Lowest BIC value: 6907.383
AIC 6855.624 BIC 6907.383 Train MAPE 16.783 Train RMSE 53.829 Validate MAPE 12.845 Validate RMSE 45.666 Name: (1,1), dtype: float64
Observations:
#VARMA Predictions for train and test data
columns = ['Gas Consumption (tons)', 'Water Consumption (tons)']
varma_model = VARMAX(endog=train_set[columns], order=(1,1)).fit(disp=False)
(trainPredicted, testPredicted, pred_full, varma_model, aic, bic,
va_Train_MAPE, va_Train_RMSE, va_Validate_MAPE, va_validate_RMSE) = walk_forward_validate_varma(
columns, train_set, test_set, varma_model, (1,1)
)
display(pred_full.head())
| Gas Consumption (tons) | Water Consumption (tons) | |
|---|---|---|
| DATE | ||
| 1990-01-01 | 23.160686 | 489.703408 |
| 1990-02-01 | 19.559033 | 528.456274 |
| 1990-03-01 | 18.258346 | 577.290008 |
| 1990-04-01 | 18.045345 | 535.884574 |
| 1990-05-01 | 19.383292 | 526.591122 |
fig,ax = plt.subplots(2, 1)
fig.suptitle("VARMA Actual VS Predicted", fontsize=14, fontweight="bold")
#Plot actual & predicted values on the graph
index=0
for column in columns:
energyConsumption_df[column].plot(ax=ax[index], label="actual", color='dodgerblue')
pred_full[column].plot(ax=ax[index], label='predicted', linestyle='--', color='orange')
ax[index].legend()
ax[index].set_title(f'{column}')
index+=1
plt.tight_layout()
plt.show()
Observations:
#Impulse-Response Function
varma_model = VARMAX(endog=train_set[columns], order=(1,1)).fit(disp=False)
gas_irfs = varma_model.impulse_responses(24, impulse=0, orthogonalized=True)
water_irfs = varma_model.impulse_responses(24, impulse=1, orthogonalized=True)
#Plot impulse reaction graph for better visualisation
fig, ax = plt.subplots(2, 1, figsize=(15,9))
index=0
irfs=[gas_irfs, water_irfs]
for col in columns:
ax[index].plot(range(0,25), irfs[index], marker='o', label=["Gas", "Water"])
ax[index].set_xlabel("Time Steps")
ax[index].set_ylabel("Impulse-Response Value")
ax[index].set_title(f"Impulse-Response Function - {col}")
ax[index].legend()
ax[index].grid(True)
index+=1
plt.tight_layout()
plt.show()
Observations:
#Perform durbin watson test
dw_results = durbin_watson_test(df=energyConsumption_df[columns], resid=varma_model.resid)
display(dw_results)
| Durbin Watson Test Statistic | |
|---|---|
| Gas Consumption (tons) | 2.06 |
| Water Consumption (tons) | 1.92 |
Observations:
#Diagnostics plot for Gas Consumption
fig=varma_model.plot_diagnostics(variable=0)
plt.gcf().suptitle('Gas Consumption - Diagnostics', fontsize=18)
ax = fig.axes[0]
ax.set_xlim(250, 557)
ax.set_ylim(-5,9)
ax.axhline(0)
plt.tight_layout()
plt.show()
Observations for Gas Consumption:
#Diagnostics plot for Water Consumption
fig=varma_model.plot_diagnostics(variable=1)
plt.gcf().suptitle('Water Consumption - Diagnostics', fontsize=18)
ax = fig.axes[0]
ax.set_xlim(240, 555)
ax.set_ylim(-5,9)
ax.axhline(0)
plt.tight_layout()
plt.show()
Observations for Water Consumption:
#Try out p=1, q=1
columns = ['Gas Consumption (tons)', 'Water Consumption (tons)']
varma_model = varma_model = VARMAX(endog=train_set[columns], order=(1,1)).fit(disp=False)
#Summary
varma_model.summary()
| Dep. Variable: | ['Gas Consumption (tons)', 'Water Consumption (tons)'] | No. Observations: | 317 |
|---|---|---|---|
| Model: | VARMA(1,1) | Log Likelihood | -2744.862 |
| + intercept | AIC | 5515.725 | |
| Date: | Fri, 11 Aug 2023 | BIC | 5564.591 |
| Time: | 16:19:32 | HQIC | 5535.244 |
| Sample: | 01-01-1990 | ||
| - 05-01-2016 | |||
| Covariance Type: | opg |
| Ljung-Box (L1) (Q): | 0.43, 1.54 | Jarque-Bera (JB): | 2157.75, 59.97 |
|---|---|---|---|
| Prob(Q): | 0.51, 0.21 | Prob(JB): | 0.00, 0.00 |
| Heteroskedasticity (H): | 2.12, 1.39 | Skew: | 0.79, -0.41 |
| Prob(H) (two-sided): | 0.00, 0.09 | Kurtosis: | 15.68, 4.97 |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| intercept | 7.9184 | 2.261 | 3.503 | 0.000 | 3.487 | 12.349 |
| L1.Gas Consumption (tons) | 0.7724 | 0.067 | 11.583 | 0.000 | 0.642 | 0.903 |
| L1.Water Consumption (tons) | -0.0054 | 0.002 | -2.267 | 0.023 | -0.010 | -0.001 |
| L1.e(Gas Consumption (tons)) | -0.1743 | 0.069 | -2.545 | 0.011 | -0.309 | -0.040 |
| L1.e(Water Consumption (tons)) | 0.0097 | 0.004 | 2.746 | 0.006 | 0.003 | 0.017 |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| intercept | 133.9090 | 58.055 | 2.307 | 0.021 | 20.123 | 247.695 |
| L1.Gas Consumption (tons) | -1.5998 | 1.601 | -0.999 | 0.318 | -4.738 | 1.539 |
| L1.Water Consumption (tons) | 0.8022 | 0.065 | 12.250 | 0.000 | 0.674 | 0.931 |
| L1.e(Gas Consumption (tons)) | 2.9725 | 1.916 | 1.552 | 0.121 | -0.782 | 6.727 |
| L1.e(Water Consumption (tons)) | -0.3416 | 0.092 | -3.708 | 0.000 | -0.522 | -0.161 |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| sqrt.var.Gas Consumption (tons) | 3.6825 | 0.096 | 38.265 | 0.000 | 3.494 | 3.871 |
| sqrt.cov.Gas Consumption (tons).Water Consumption (tons) | -51.7898 | 5.275 | -9.817 | 0.000 | -62.129 | -41.450 |
| sqrt.var.Water Consumption (tons) | 95.0667 | 3.322 | 28.615 | 0.000 | 88.555 | 101.578 |
Observations:
In summary, SARIMA works better for Electricity Consumption which clearly show seasonality in the data. ARIMA tends to perform better for Gas Consumption & Water Consumption which do not show seasonality in the data. VARMA work well when there is causality and cointegration between multiple variables but it will require additional data transformation as shown. Model improvement to find the most suitable order for each data will require more time and higher processing speed as larger number of iterations will be required to find the best order. Vector Error Correction Models (VECM) could be run if there was more time for variables that show long term relationship and no short term relationship between one another based on the Engle-Granger Cointegration Test results & Granger Causality Test.